{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "(get-started-dask-dataframe)=\n", "# Getting Started with Dask DataFrame\n", "\n", "In this section, we will demonstrate how to parallelize pandas DataFrame using Dask DataFrame." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating Dask DataFrame\n", "\n", "We can generate a Dask DataFrame named `ddf`, which is a time series dataset that is randomly generated. Each data sample represents one second, totaling four days (from 2024-01-01 0:00 to 2024-01-05 0:00)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameidxy
npartitions=4
2024-01-01stringint64float64float64
2024-01-02............
2024-01-03............
2024-01-04............
2024-01-05............
\n", "
\n", "
Dask Name: to_pyarrow_string, 2 graph layers
" ], "text/plain": [ "Dask DataFrame Structure:\n", " name id x y\n", "npartitions=4 \n", "2024-01-01 string int64 float64 float64\n", "2024-01-02 ... ... ... ...\n", "2024-01-03 ... ... ... ...\n", "2024-01-04 ... ... ... ...\n", "2024-01-05 ... ... ... ...\n", "Dask Name: to_pyarrow_string, 2 graph layers" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import dask\n", "\n", "ddf = dask.datasets.timeseries(start=\"2024-01-01\",\n", " end=\"2024-01-05\")\n", "ddf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All operations in pandas are executed immediately (i.e., Eager Execution). Dask, on the other hand, is executed lazily, and the above data has not been computed, hence represented by ellipsis (...)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the data in the Dask DataFrame (`ddf`) has not been computed yet, Dask has already retrieved the column names and data types. You can view this information using the `dtypes` attribute:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name string[pyarrow]\n", "id int64\n", "x float64\n", "y float64\n", "dtype: object" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Trigger Computation\n", "\n", "To compute and obtain results, it is necessary to manually trigger the computation using the `compute()` method." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "scroll-output" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameidxy
timestamp
2024-01-01 00:00:00Zelda988-0.8730670.157726
2024-01-01 00:00:01Oliver9780.2617350.516558
2024-01-01 00:00:02Victor9850.510027-0.259912
2024-01-01 00:00:03Yvonne999-0.4880900.463293
2024-01-01 00:00:04Hannah978-0.8793850.886313
...............
2024-01-04 23:59:55Xavier9950.650247-0.699778
2024-01-04 23:59:56Michael1027-0.8506680.867499
2024-01-04 23:59:57Ursula9690.0327130.125985
2024-01-04 23:59:58Victor999-0.4811420.920718
2024-01-04 23:59:59Edith10230.3661300.296003
\n", "

345600 rows × 4 columns

\n", "
" ], "text/plain": [ " name id x y\n", "timestamp \n", "2024-01-01 00:00:00 Zelda 988 -0.873067 0.157726\n", "2024-01-01 00:00:01 Oliver 978 0.261735 0.516558\n", "2024-01-01 00:00:02 Victor 985 0.510027 -0.259912\n", "2024-01-01 00:00:03 Yvonne 999 -0.488090 0.463293\n", "2024-01-01 00:00:04 Hannah 978 -0.879385 0.886313\n", "... ... ... ... ...\n", "2024-01-04 23:59:55 Xavier 995 0.650247 -0.699778\n", "2024-01-04 23:59:56 Michael 1027 -0.850668 0.867499\n", "2024-01-04 23:59:57 Ursula 969 0.032713 0.125985\n", "2024-01-04 23:59:58 Victor 999 -0.481142 0.920718\n", "2024-01-04 23:59:59 Edith 1023 0.366130 0.296003\n", "\n", "[345600 rows x 4 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dask DataFrame has a crucial built-in variable called `npartitions`. It signifies the number of divisions or partitions the data has been split into. As illustrated in {numref}`dask-dataframe-partition`, a Dask DataFrame comprises multiple pandas DataFrames, with each pandas DataFrame referred to as a partition." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.npartitions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{figure} ../img/ch-dask/dask-dataframe.svg\n", "---\n", "width: 400px\n", "name: dask-dataframe-partition\n", "---\n", "A Dask DataFrame comprises multiple pandas DataFrames\n", "```\n", "Each partition in a Dask DataFrame is defined by upper and lower bounds. In this example, `ddf` is partitioned based on the time column, with each day's data forming a distinct partition. The built-in variable `divisions` holds the boundary lines for each partition:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "scroll-output" ] }, "outputs": [ { "data": { "text/plain": [ "(Timestamp('2024-01-01 00:00:00'),\n", " Timestamp('2024-01-02 00:00:00'),\n", " Timestamp('2024-01-03 00:00:00'),\n", " Timestamp('2024-01-04 00:00:00'),\n", " Timestamp('2024-01-05 00:00:00'))" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf.divisions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Index\n", "\n", ":::{note}\n", "In a pandas DataFrame, there is a dedicated column for storing the index, which can be numeric, such as row numbers, or temporal. The index column is typically used solely for indexing purposes and is not considered a data field; hence, it is not visible in `ddf.dtypes`.\n", ":::\n", "\n", "In this example, the index of `ddf` is temporal, and each partition boundary is based on this index column. The entire `ddf` spans four days of data, with each partition representing a single day.\n", "\n", "Now, let's select data from 2024-01-01 0:00 to 2024-01-02 5:00, spanning two days and two partitions." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dask DataFrame Structure:
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameidxy
npartitions=2
2024-01-01 00:00:00.000000000stringint64float64float64
2024-01-02 00:00:00.000000000............
2024-01-02 05:00:59.999999999............
\n", "
\n", "
Dask Name: loc, 3 graph layers
" ], "text/plain": [ "Dask DataFrame Structure:\n", " name id x y\n", "npartitions=2 \n", "2024-01-01 00:00:00.000000000 string int64 float64 float64\n", "2024-01-02 00:00:00.000000000 ... ... ... ...\n", "2024-01-02 05:00:59.999999999 ... ... ... ...\n", "Dask Name: loc, 3 graph layers" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[\"2024-01-01 0:00\": \"2024-01-02 5:00\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `compute()` to trigger the computation and obtain the results:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameidxy
timestamp
2024-01-01 00:00:00Zelda988-0.8730670.157726
2024-01-01 00:00:01Oliver9780.2617350.516558
2024-01-01 00:00:02Victor9850.510027-0.259912
2024-01-01 00:00:03Yvonne999-0.4880900.463293
2024-01-01 00:00:04Hannah978-0.8793850.886313
...............
2024-01-02 05:00:55Bob10290.2185200.462808
2024-01-02 05:00:56Alice10050.7228500.067173
2024-01-02 05:00:57Tim990-0.343179-0.812177
2024-01-02 05:00:58Patricia9990.995964-0.669022
2024-01-02 05:00:59Jerry1089-0.9258390.004172
\n", "

104460 rows × 4 columns

\n", "
" ], "text/plain": [ " name id x y\n", "timestamp \n", "2024-01-01 00:00:00 Zelda 988 -0.873067 0.157726\n", "2024-01-01 00:00:01 Oliver 978 0.261735 0.516558\n", "2024-01-01 00:00:02 Victor 985 0.510027 -0.259912\n", "2024-01-01 00:00:03 Yvonne 999 -0.488090 0.463293\n", "2024-01-01 00:00:04 Hannah 978 -0.879385 0.886313\n", "... ... ... ... ...\n", "2024-01-02 05:00:55 Bob 1029 0.218520 0.462808\n", "2024-01-02 05:00:56 Alice 1005 0.722850 0.067173\n", "2024-01-02 05:00:57 Tim 990 -0.343179 -0.812177\n", "2024-01-02 05:00:58 Patricia 999 0.995964 -0.669022\n", "2024-01-02 05:00:59 Jerry 1089 -0.925839 0.004172\n", "\n", "[104460 rows x 4 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf[\"2024-01-01 0:00\": \"2024-01-02 5:00\"].compute()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas Compatibility\n", "\n", "Most operations of Dask DataFrame and pandas are similar, allowing us to employ Dask DataFrame much like we would with pandas.\n", "\n", "For instance, data filtering and groupby operations are conducted in a manner analogous to pandas:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dask Series Structure:\n", "npartitions=1\n", " float64\n", " ...\n", "Name: x, dtype: float64\n", "Dask Name: sqrt, 10 graph layers" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf2 = ddf[ddf.y > 0]\n", "ddf3 = ddf2.groupby(\"name\").x.std()\n", "ddf3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results are still represented by ellipsis (...) because the computation is deferred and requires invoking `compute()` to trigger execution." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "name\n", "Alice 0.570873\n", "Bob 0.574338\n", "Charlie 0.577622\n", "Dan 0.578480\n", "Edith 0.579528\n", "Frank 0.576647\n", "George 0.576759\n", "Hannah 0.580472\n", "Ingrid 0.579585\n", "Jerry 0.577015\n", "Kevin 0.574981\n", "Laura 0.578379\n", "Michael 0.574430\n", "Norbert 0.575075\n", "Oliver 0.577198\n", "Patricia 0.579569\n", "Quinn 0.573159\n", "Ray 0.577756\n", "Sarah 0.573136\n", "Tim 0.574545\n", "Ursula 0.580282\n", "Victor 0.578231\n", "Wendy 0.576533\n", "Xavier 0.578146\n", "Yvonne 0.576635\n", "Zelda 0.570968\n", "Name: x, dtype: float64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "computed_ddf = ddf3.compute()\n", "computed_ddf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computational Graph\n", "\n", "Until now, we understand that Dask DataFrame divides large datasets into partitions and operates with a deferred execution manner. Before executing `compute()`, Dask has built a Task Graph, and you can visualize this Task Graph using `visualize()`:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "8975093438328741264\n", "\n", "getitem\n", "\n", "\n", "\n", "4154053050679792461\n", "\n", "0\n", "\n", "\n", "\n", "8975093438328741264->4154053050679792461\n", "\n", "\n", "\n", "\n", "\n", "1009892751118477009\n", "\n", "getitem\n", "\n", "\n", "\n", "4154053050679792461->1009892751118477009\n", "\n", "\n", "\n", "\n", "\n", "-1886318307358242303\n", "\n", "0\n", "\n", "\n", "\n", "-1886318307358242303->8975093438328741264\n", "\n", "\n", "\n", "\n", "\n", "2368622691986073372\n", "\n", "getitem\n", "\n", "\n", "\n", "-1886318307358242303->2368622691986073372\n", "\n", "\n", "\n", "\n", "\n", "-7049029746738975710\n", "\n", "0\n", "\n", "\n", "\n", "-7049029746738975710->8975093438328741264\n", "\n", "\n", "\n", "\n", "\n", "-4545202342721560692\n", "\n", "getitem\n", "\n", "\n", "\n", "-7785050266927760730\n", "\n", "1\n", "\n", "\n", "\n", "-4545202342721560692->-7785050266927760730\n", "\n", "\n", "\n", "\n", "\n", "-26946661153843634\n", "\n", "getitem\n", "\n", "\n", "\n", "-7785050266927760730->-26946661153843634\n", "\n", "\n", "\n", "\n", "\n", "4621322448743756122\n", "\n", "1\n", "\n", "\n", "\n", "4621322448743756122->-4545202342721560692\n", "\n", "\n", "\n", "\n", "\n", "2407315613922909122\n", "\n", "getitem\n", "\n", "\n", "\n", "4621322448743756122->2407315613922909122\n", "\n", "\n", "\n", "\n", "\n", "-541388990636977285\n", "\n", "1\n", "\n", "\n", "\n", "-541388990636977285->-4545202342721560692\n", "\n", "\n", "\n", "\n", "\n", "-4704240553267036212\n", "\n", "getitem\n", "\n", "\n", "\n", "5768619777809094520\n", "\n", "2\n", "\n", "\n", "\n", "-4704240553267036212->5768619777809094520\n", "\n", "\n", "\n", "\n", "\n", "-831073566302416676\n", "\n", "getitem\n", "\n", "\n", "\n", "5768619777809094520->-831073566302416676\n", "\n", "\n", "\n", "\n", "\n", "-271751580228940244\n", "\n", "2\n", "\n", "\n", "\n", "-271751580228940244->-4704240553267036212\n", "\n", "\n", "\n", "\n", "\n", "725387507047490957\n", "\n", "getitem\n", "\n", "\n", "\n", "-271751580228940244->725387507047490957\n", "\n", "\n", "\n", "\n", "\n", "-5434463019609673651\n", "\n", "2\n", "\n", "\n", "\n", "-5434463019609673651->-4704240553267036212\n", "\n", "\n", "\n", "\n", "\n", "-6386168660142454377\n", "\n", "getitem\n", "\n", "\n", "\n", "7877903150926724817\n", "\n", "3\n", "\n", "\n", "\n", "-6386168660142454377->7877903150926724817\n", "\n", "\n", "\n", "\n", "\n", "3408245606136475932\n", "\n", "getitem\n", "\n", "\n", "\n", "7877903150926724817->3408245606136475932\n", "\n", "\n", "\n", "\n", "\n", "1837531792888690053\n", "\n", "3\n", "\n", "\n", "\n", "1837531792888690053->-6386168660142454377\n", "\n", "\n", "\n", "\n", "\n", "566349296502015437\n", "\n", "getitem\n", "\n", "\n", "\n", "1837531792888690053->566349296502015437\n", "\n", "\n", "\n", "\n", "\n", "1073177736492324774\n", "\n", "3\n", "\n", "\n", "\n", "1073177736492324774->-6386168660142454377\n", "\n", "\n", "\n", "\n", "\n", "8198538481943568633\n", "\n", "make-timeseries\n", "\n", "\n", "\n", "4961396763331876892\n", "\n", "0\n", "\n", "\n", "\n", "8198538481943568633->4961396763331876892\n", "\n", "\n", "\n", "\n", "\n", "-4072055406750210966\n", "\n", "to_pyarrow_string\n", "\n", "\n", "\n", "4961396763331876892->-4072055406750210966\n", "\n", "\n", "\n", "\n", "\n", "2118252992083782340\n", "\n", "make-timeseries\n", "\n", "\n", "\n", "7070680136449507189\n", "\n", "1\n", "\n", "\n", "\n", "2118252992083782340->7070680136449507189\n", "\n", "\n", "\n", "\n", "\n", "-4876182311898784008\n", "\n", "to_pyarrow_string\n", "\n", "\n", "\n", "7070680136449507189->-4876182311898784008\n", "\n", "\n", "\n", "\n", "\n", "-4200322793466611051\n", "\n", "make-timeseries\n", "\n", "\n", "\n", "2177606107476810823\n", "\n", "2\n", "\n", "\n", "\n", "-4200322793466611051->2177606107476810823\n", "\n", "\n", "\n", "\n", "\n", "7251985976260374217\n", "\n", "to_pyarrow_string\n", "\n", "\n", "\n", "2177606107476810823->7251985976260374217\n", "\n", "\n", "\n", "\n", "\n", "-5004449698615184093\n", "\n", "make-timeseries\n", "\n", "\n", "\n", "8685246863578809248\n", "\n", "3\n", "\n", "\n", "\n", "-5004449698615184093->8685246863578809248\n", "\n", "\n", "\n", "\n", "\n", "1171700486400587924\n", "\n", "to_pyarrow_string\n", "\n", "\n", "\n", "8685246863578809248->1171700486400587924\n", "\n", "\n", "\n", "\n", "\n", "-4072055406750210966->-1886318307358242303\n", "\n", "\n", "\n", "\n", "\n", "-4876182311898784008->4621322448743756122\n", "\n", "\n", "\n", "\n", "\n", "7251985976260374217->-271751580228940244\n", "\n", "\n", "\n", "\n", "\n", "1171700486400587924->1837531792888690053\n", "\n", "\n", "\n", "\n", "\n", "-1721430607841532440\n", "\n", "0\n", "\n", "\n", "\n", "2368622691986073372->-1721430607841532440\n", "\n", "\n", "\n", "\n", "\n", "-7327413143243567357\n", "\n", "gt\n", "\n", "\n", "\n", "-1721430607841532440->-7327413143243567357\n", "\n", "\n", "\n", "\n", "\n", "4786210148260465985\n", "\n", "1\n", "\n", "\n", "\n", "2407315613922909122->4786210148260465985\n", "\n", "\n", "\n", "\n", "\n", "-7288720221306731607\n", "\n", "gt\n", "\n", "\n", "\n", "4786210148260465985->-7288720221306731607\n", "\n", "\n", "\n", "\n", "\n", "6895493521378096282\n", "\n", "2\n", "\n", "\n", "\n", "725387507047490957->6895493521378096282\n", "\n", "\n", "\n", "\n", "\n", "-3049401048867838999\n", "\n", "gt\n", "\n", "\n", "\n", "6895493521378096282->-3049401048867838999\n", "\n", "\n", "\n", "\n", "\n", "2002419492405399916\n", "\n", "3\n", "\n", "\n", "\n", "566349296502015437->2002419492405399916\n", "\n", "\n", "\n", "\n", "\n", "1877047243791410661\n", "\n", "gt\n", "\n", "\n", "\n", "2002419492405399916->1877047243791410661\n", "\n", "\n", "\n", "\n", "\n", "-7327413143243567357->-7049029746738975710\n", "\n", "\n", "\n", "\n", "\n", "-7288720221306731607->-541388990636977285\n", "\n", "\n", "\n", "\n", "\n", "-3049401048867838999->-5434463019609673651\n", "\n", "\n", "\n", "\n", "\n", "1877047243791410661->1073177736492324774\n", "\n", "\n", "\n", "\n", "\n", "-8365535502471265069\n", "\n", "0\n", "\n", "\n", "\n", "1009892751118477009->-8365535502471265069\n", "\n", "\n", "\n", "\n", "\n", "-7331327813854976558\n", "\n", "series-groupby-var-chunk\n", "\n", "\n", "\n", "-8365535502471265069->-7331327813854976558\n", "\n", "\n", "\n", "\n", "\n", "789777159281222053\n", "\n", "1\n", "\n", "\n", "\n", "-26946661153843634->789777159281222053\n", "\n", "\n", "\n", "\n", "\n", "-8762148228682956039\n", "\n", "series-groupby-var-chunk\n", "\n", "\n", "\n", "789777159281222053->-8762148228682956039\n", "\n", "\n", "\n", "\n", "\n", "7297417915383220478\n", "\n", "2\n", "\n", "\n", "\n", "-831073566302416676->7297417915383220478\n", "\n", "\n", "\n", "\n", "\n", "1834439651243165710\n", "\n", "series-groupby-var-chunk\n", "\n", "\n", "\n", "7297417915383220478->1834439651243165710\n", "\n", "\n", "\n", "\n", "\n", "2404343886410524112\n", "\n", "3\n", "\n", "\n", "\n", "3408245606136475932->2404343886410524112\n", "\n", "\n", "\n", "\n", "\n", "1675401440697690190\n", "\n", "series-groupby-var-chunk\n", "\n", "\n", "\n", "2404343886410524112->1675401440697690190\n", "\n", "\n", "\n", "\n", "\n", "7654874445250473085\n", "\n", "0\n", "\n", "\n", "\n", "-7331327813854976558->7654874445250473085\n", "\n", "\n", "\n", "\n", "\n", "-4313594997787244580\n", "\n", "series-groupby-var-agg\n", "\n", "\n", "\n", "7654874445250473085->-4313594997787244580\n", "\n", "\n", "\n", "\n", "\n", "2761800416277776719\n", "\n", "1\n", "\n", "\n", "\n", "-8762148228682956039->2761800416277776719\n", "\n", "\n", "\n", "\n", "\n", "2761800416277776719->-4313594997787244580\n", "\n", "\n", "\n", "\n", "\n", "-9177302901329776472\n", "\n", "2\n", "\n", "\n", "\n", "1834439651243165710->-9177302901329776472\n", "\n", "\n", "\n", "\n", "\n", "-9177302901329776472->-4313594997787244580\n", "\n", "\n", "\n", "\n", "\n", "4376367143407078778\n", "\n", "3\n", "\n", "\n", "\n", "1675401440697690190->4376367143407078778\n", "\n", "\n", "\n", "\n", "\n", "4376367143407078778->-4313594997787244580\n", "\n", "\n", "\n", "\n", "\n", "-6860747317429835543\n", "\n", "0\n", "\n", "\n", "\n", "-4313594997787244580->-6860747317429835543\n", "\n", "\n", "\n", "\n", "\n", "8499689125527699742\n", "\n", "getitem\n", "\n", "\n", "\n", "-6860747317429835543->8499689125527699742\n", "\n", "\n", "\n", "\n", "\n", "808338967462093868\n", "\n", "0\n", "\n", "\n", "\n", "8499689125527699742->808338967462093868\n", "\n", "\n", "\n", "\n", "\n", "2143486554077959639\n", "\n", "sqrt\n", "\n", "\n", "\n", "808338967462093868->2143486554077959639\n", "\n", "\n", "\n", "\n", "\n", "5661882285381253292\n", "\n", "0\n", "\n", "\n", "\n", "2143486554077959639->5661882285381253292\n", "\n", "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ddf3.visualize(filename=\"../img/ch-dask/visualize-task-graph\", format=\"svg\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the Task Graph, circles represent computations, and rectangles represent data. For Dask DataFrame, the rectangles correspond to pandas DataFrame partitions." ] } ], "metadata": { "kernelspec": { "display_name": "dispy", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 2 }