{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(sec-dask-map_partitions)=\n",
    "# `map_partitions`\n",
    "\n",
    "Aside from the tasks that require communication as mentioned in section {numref}`sec-dask-dataframe-shuffle`, there is a simplest form of parallelism known as \"Embarrassingly Parallel\". This term refers to computations that do not require communication across workers. For instance, adding one to a field can be performed by each worker simply by executing the addition operation, without the overhead of communication between workers. In Dask DataFrame, such Embarrassingly Parallel operations can be carried out using `map_partitions(func)`. The argument of `map_partitions(func)` is a function `func`, which is executed on each pandas DataFrame, allowing the use of various operations within the pandas DataFrame. As shown in Figure {numref}`fig-dask-map-partitions`, `map_partitions(func)` performs a transformation operation on the original pandas DataFrame.\n",
    "\n",
    "```{figure} ../img/ch-dask-dataframe/map-partitions.svg\n",
    "---\n",
    "width: 600px\n",
    "name: fig-dask-map-partitions\n",
    "---\n",
    "map_partitions()\n",
    "```\n",
    "\n",
    "## Case Study: New York Taxi Data\n",
    "\n",
    "We utilize the New York taxi dataset to perform a simple data preprocessing task: calculating the duration of each trip. In the original dataset, `tpep_pickup_datetime` and `tpep_dropoff_datetime` represent the times when passengers are picked up and dropped off, respectively. We subtract the pickup time `tpep_pickup_datetime` from the drop-off time `tpep_dropoff_datetime`. This calculation does not incur communication overhead across workers, making it a typical application scenario for Embarrassingly Parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"..\")\n",
    "from utils import nyc_taxi\n",
    "\n",
    "import pandas as pd\n",
    "import dask\n",
    "dask.config.set({'dataframe.query-planning': False})\n",
    "import dask.dataframe as dd\n",
    "import pandas as pd\n",
    "from dask.distributed import LocalCluster, Client\n",
    "\n",
    "cluster = LocalCluster()\n",
    "client = Client(cluster)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_path = nyc_taxi()\n",
    "ddf = dd.read_parquet(dataset_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VendorID</th>\n",
       "      <th>trip_duration</th>\n",
       "      <th>tpep_pickup_datetime</th>\n",
       "      <th>tpep_dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>RatecodeID</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>PULocationID</th>\n",
       "      <th>DOLocationID</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>extra</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>improvement_surcharge</th>\n",
       "      <th>total_amount</th>\n",
       "      <th>congestion_surcharge</th>\n",
       "      <th>Airport_fee</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1253</td>\n",
       "      <td>2023-06-01 00:08:48</td>\n",
       "      <td>2023-06-01 00:29:41</td>\n",
       "      <td>1.0</td>\n",
       "      <td>3.40</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>140</td>\n",
       "      <td>238</td>\n",
       "      <td>1</td>\n",
       "      <td>21.9</td>\n",
       "      <td>3.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>6.70</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>33.60</td>\n",
       "      <td>2.5</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>614</td>\n",
       "      <td>2023-06-01 00:15:04</td>\n",
       "      <td>2023-06-01 00:25:18</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.40</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>50</td>\n",
       "      <td>151</td>\n",
       "      <td>1</td>\n",
       "      <td>15.6</td>\n",
       "      <td>3.50</td>\n",
       "      <td>0.5</td>\n",
       "      <td>3.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>23.60</td>\n",
       "      <td>2.5</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1123</td>\n",
       "      <td>2023-06-01 00:48:24</td>\n",
       "      <td>2023-06-01 01:07:07</td>\n",
       "      <td>1.0</td>\n",
       "      <td>10.20</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>138</td>\n",
       "      <td>97</td>\n",
       "      <td>1</td>\n",
       "      <td>40.8</td>\n",
       "      <td>7.75</td>\n",
       "      <td>0.5</td>\n",
       "      <td>10.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>60.05</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2</td>\n",
       "      <td>1406</td>\n",
       "      <td>2023-06-01 00:54:03</td>\n",
       "      <td>2023-06-01 01:17:29</td>\n",
       "      <td>3.0</td>\n",
       "      <td>9.83</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>100</td>\n",
       "      <td>244</td>\n",
       "      <td>1</td>\n",
       "      <td>39.4</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.5</td>\n",
       "      <td>8.88</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>53.28</td>\n",
       "      <td>2.5</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>514</td>\n",
       "      <td>2023-06-01 00:18:44</td>\n",
       "      <td>2023-06-01 00:27:18</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.17</td>\n",
       "      <td>1.0</td>\n",
       "      <td>N</td>\n",
       "      <td>137</td>\n",
       "      <td>234</td>\n",
       "      <td>1</td>\n",
       "      <td>9.3</td>\n",
       "      <td>1.00</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.72</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>15.02</td>\n",
       "      <td>2.5</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   VendorID  trip_duration tpep_pickup_datetime tpep_dropoff_datetime  \\\n",
       "0         1           1253  2023-06-01 00:08:48   2023-06-01 00:29:41   \n",
       "1         1            614  2023-06-01 00:15:04   2023-06-01 00:25:18   \n",
       "2         1           1123  2023-06-01 00:48:24   2023-06-01 01:07:07   \n",
       "3         2           1406  2023-06-01 00:54:03   2023-06-01 01:17:29   \n",
       "4         2            514  2023-06-01 00:18:44   2023-06-01 00:27:18   \n",
       "\n",
       "   passenger_count  trip_distance  RatecodeID store_and_fwd_flag  \\\n",
       "0              1.0           3.40         1.0                  N   \n",
       "1              0.0           3.40         1.0                  N   \n",
       "2              1.0          10.20         1.0                  N   \n",
       "3              3.0           9.83         1.0                  N   \n",
       "4              1.0           1.17         1.0                  N   \n",
       "\n",
       "   PULocationID  DOLocationID  payment_type  fare_amount  extra  mta_tax  \\\n",
       "0           140           238             1         21.9   3.50      0.5   \n",
       "1            50           151             1         15.6   3.50      0.5   \n",
       "2           138            97             1         40.8   7.75      0.5   \n",
       "3           100           244             1         39.4   1.00      0.5   \n",
       "4           137           234             1          9.3   1.00      0.5   \n",
       "\n",
       "   tip_amount  tolls_amount  improvement_surcharge  total_amount  \\\n",
       "0        6.70           0.0                    1.0         33.60   \n",
       "1        3.00           0.0                    1.0         23.60   \n",
       "2       10.00           0.0                    1.0         60.05   \n",
       "3        8.88           0.0                    1.0         53.28   \n",
       "4        0.72           0.0                    1.0         15.02   \n",
       "\n",
       "   congestion_surcharge  Airport_fee  \n",
       "0                   2.5         0.00  \n",
       "1                   2.5         0.00  \n",
       "2                   0.0         1.75  \n",
       "3                   2.5         0.00  \n",
       "4                   2.5         0.00  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def transform(df):\n",
    "    df[\"trip_duration\"] = (df[\"tpep_dropoff_datetime\"] - df[\"tpep_pickup_datetime\"]).dt.seconds\n",
    "    # 将 `trip_duration` 挪到前面\n",
    "    dur_column = df.pop('trip_duration')\n",
    "    df.insert(1, dur_column.name, dur_column)\n",
    "    return df\n",
    "\n",
    "ddf = ddf.map_partitions(transform)\n",
    "ddf.compute()\n",
    "ddf.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the APIs in Dask DataFrame are Embarrassingly Parallel, and they are implemented using `map_partitions()`.\n",
    "\n",
    "As mentioned in {numref}`sec-dask-dataframe-indexing`, Dask DataFrame partitions on a specific column (index column), but if `map_partitions()` modifies these index columns, it is necessary to use `clear_divisions()` or to `set_index()` again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><strong>Dask DataFrame Structure:</strong></div>\n",
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>VendorID</th>\n",
       "      <th>trip_duration</th>\n",
       "      <th>tpep_pickup_datetime</th>\n",
       "      <th>tpep_dropoff_datetime</th>\n",
       "      <th>passenger_count</th>\n",
       "      <th>trip_distance</th>\n",
       "      <th>RatecodeID</th>\n",
       "      <th>store_and_fwd_flag</th>\n",
       "      <th>PULocationID</th>\n",
       "      <th>DOLocationID</th>\n",
       "      <th>payment_type</th>\n",
       "      <th>fare_amount</th>\n",
       "      <th>extra</th>\n",
       "      <th>mta_tax</th>\n",
       "      <th>tip_amount</th>\n",
       "      <th>tolls_amount</th>\n",
       "      <th>improvement_surcharge</th>\n",
       "      <th>total_amount</th>\n",
       "      <th>congestion_surcharge</th>\n",
       "      <th>Airport_fee</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>npartitions=1</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>int32</td>\n",
       "      <td>int32</td>\n",
       "      <td>datetime64[us]</td>\n",
       "      <td>datetime64[us]</td>\n",
       "      <td>int64</td>\n",
       "      <td>float64</td>\n",
       "      <td>int64</td>\n",
       "      <td>string</td>\n",
       "      <td>int32</td>\n",
       "      <td>int32</td>\n",
       "      <td>int64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "      <td>float64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div>Dask Name: transform, 2 graph layers</div>"
      ],
      "text/plain": [
       "Dask DataFrame Structure:\n",
       "              VendorID trip_duration tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount    extra  mta_tax tip_amount tolls_amount improvement_surcharge total_amount congestion_surcharge Airport_fee\n",
       "npartitions=1                                                                                                                                                                                                                                                                                               \n",
       "                 int32         int32       datetime64[us]        datetime64[us]           int64       float64      int64             string        int32        int32        int64     float64  float64  float64    float64      float64               float64      float64              float64     float64\n",
       "                   ...           ...                  ...                   ...             ...           ...        ...                ...          ...          ...          ...         ...      ...      ...        ...          ...                   ...          ...                  ...         ...\n",
       "Dask Name: transform, 2 graph layers"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ddf.clear_divisions()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "client.shutdown()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dispy",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}