{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(get-started-dask-dataframe)=\n",
"# Getting Started with Dask DataFrame\n",
"\n",
"In this section, we will demonstrate how to parallelize pandas DataFrame using Dask DataFrame."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Dask DataFrame\n",
"\n",
"We can generate a Dask DataFrame named `ddf`, which is a time series dataset that is randomly generated. Each data sample represents one second, totaling four days (from 2024-01-01 0:00 to 2024-01-05 0:00)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
Dask DataFrame Structure:
\n",
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
name
\n",
"
id
\n",
"
x
\n",
"
y
\n",
"
\n",
"
\n",
"
npartitions=4
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2024-01-01
\n",
"
string
\n",
"
int64
\n",
"
float64
\n",
"
float64
\n",
"
\n",
"
\n",
"
2024-01-02
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2024-01-03
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2024-01-04
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2024-01-05
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Dask Name: to_pyarrow_string, 2 graph layers
"
],
"text/plain": [
"Dask DataFrame Structure:\n",
" name id x y\n",
"npartitions=4 \n",
"2024-01-01 string int64 float64 float64\n",
"2024-01-02 ... ... ... ...\n",
"2024-01-03 ... ... ... ...\n",
"2024-01-04 ... ... ... ...\n",
"2024-01-05 ... ... ... ...\n",
"Dask Name: to_pyarrow_string, 2 graph layers"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import dask\n",
"\n",
"ddf = dask.datasets.timeseries(start=\"2024-01-01\",\n",
" end=\"2024-01-05\")\n",
"ddf"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All operations in pandas are executed immediately (i.e., Eager Execution). Dask, on the other hand, is executed lazily, and the above data has not been computed, hence represented by ellipsis (...)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While the data in the Dask DataFrame (`ddf`) has not been computed yet, Dask has already retrieved the column names and data types. You can view this information using the `dtypes` attribute:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"name string[pyarrow]\n",
"id int64\n",
"x float64\n",
"y float64\n",
"dtype: object"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Trigger Computation\n",
"\n",
"To compute and obtain results, it is necessary to manually trigger the computation using the `compute()` method."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"scroll-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
name
\n",
"
id
\n",
"
x
\n",
"
y
\n",
"
\n",
"
\n",
"
timestamp
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
2024-01-01 00:00:00
\n",
"
Zelda
\n",
"
988
\n",
"
-0.873067
\n",
"
0.157726
\n",
"
\n",
"
\n",
"
2024-01-01 00:00:01
\n",
"
Oliver
\n",
"
978
\n",
"
0.261735
\n",
"
0.516558
\n",
"
\n",
"
\n",
"
2024-01-01 00:00:02
\n",
"
Victor
\n",
"
985
\n",
"
0.510027
\n",
"
-0.259912
\n",
"
\n",
"
\n",
"
2024-01-01 00:00:03
\n",
"
Yvonne
\n",
"
999
\n",
"
-0.488090
\n",
"
0.463293
\n",
"
\n",
"
\n",
"
2024-01-01 00:00:04
\n",
"
Hannah
\n",
"
978
\n",
"
-0.879385
\n",
"
0.886313
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2024-01-04 23:59:55
\n",
"
Xavier
\n",
"
995
\n",
"
0.650247
\n",
"
-0.699778
\n",
"
\n",
"
\n",
"
2024-01-04 23:59:56
\n",
"
Michael
\n",
"
1027
\n",
"
-0.850668
\n",
"
0.867499
\n",
"
\n",
"
\n",
"
2024-01-04 23:59:57
\n",
"
Ursula
\n",
"
969
\n",
"
0.032713
\n",
"
0.125985
\n",
"
\n",
"
\n",
"
2024-01-04 23:59:58
\n",
"
Victor
\n",
"
999
\n",
"
-0.481142
\n",
"
0.920718
\n",
"
\n",
"
\n",
"
2024-01-04 23:59:59
\n",
"
Edith
\n",
"
1023
\n",
"
0.366130
\n",
"
0.296003
\n",
"
\n",
" \n",
"
\n",
"
345600 rows × 4 columns
\n",
"
"
],
"text/plain": [
" name id x y\n",
"timestamp \n",
"2024-01-01 00:00:00 Zelda 988 -0.873067 0.157726\n",
"2024-01-01 00:00:01 Oliver 978 0.261735 0.516558\n",
"2024-01-01 00:00:02 Victor 985 0.510027 -0.259912\n",
"2024-01-01 00:00:03 Yvonne 999 -0.488090 0.463293\n",
"2024-01-01 00:00:04 Hannah 978 -0.879385 0.886313\n",
"... ... ... ... ...\n",
"2024-01-04 23:59:55 Xavier 995 0.650247 -0.699778\n",
"2024-01-04 23:59:56 Michael 1027 -0.850668 0.867499\n",
"2024-01-04 23:59:57 Ursula 969 0.032713 0.125985\n",
"2024-01-04 23:59:58 Victor 999 -0.481142 0.920718\n",
"2024-01-04 23:59:59 Edith 1023 0.366130 0.296003\n",
"\n",
"[345600 rows x 4 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dask DataFrame has a crucial built-in variable called `npartitions`. It signifies the number of divisions or partitions the data has been split into. As illustrated in {numref}`dask-dataframe-partition`, a Dask DataFrame comprises multiple pandas DataFrames, with each pandas DataFrame referred to as a partition."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.npartitions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{figure} ../img/ch-dask/dask-dataframe.svg\n",
"---\n",
"width: 400px\n",
"name: dask-dataframe-partition\n",
"---\n",
"A Dask DataFrame comprises multiple pandas DataFrames\n",
"```\n",
"Each partition in a Dask DataFrame is defined by upper and lower bounds. In this example, `ddf` is partitioned based on the time column, with each day's data forming a distinct partition. The built-in variable `divisions` holds the boundary lines for each partition:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": [
"scroll-output"
]
},
"outputs": [
{
"data": {
"text/plain": [
"(Timestamp('2024-01-01 00:00:00'),\n",
" Timestamp('2024-01-02 00:00:00'),\n",
" Timestamp('2024-01-03 00:00:00'),\n",
" Timestamp('2024-01-04 00:00:00'),\n",
" Timestamp('2024-01-05 00:00:00'))"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ddf.divisions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Index\n",
"\n",
":::{note}\n",
"In a pandas DataFrame, there is a dedicated column for storing the index, which can be numeric, such as row numbers, or temporal. The index column is typically used solely for indexing purposes and is not considered a data field; hence, it is not visible in `ddf.dtypes`.\n",
":::\n",
"\n",
"In this example, the index of `ddf` is temporal, and each partition boundary is based on this index column. The entire `ddf` spans four days of data, with each partition representing a single day.\n",
"\n",
"Now, let's select data from 2024-01-01 0:00 to 2024-01-02 5:00, spanning two days and two partitions."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"