Ray Data Overview

8.1. Ray Data Overview#

Ray Data is a data processing framework built on top of Ray Core, primarily addressing data preparation and processing tasks related to machine learning training or inference, or the “last mile” from storage to distributed applications.

Ray Data provides an abstract class for data, ray.data.Dataset, offering common primitives for big data processing on the Dataset. It covers most stages of data processing, including:

Data loading, such as reading Parquet files.
Transformation operations on data, such as map_batches().
Grouping and aggregation operations, such as groupby().
Data shuffling, such as random_shuffle() and repartition().

Key Concepts#

Ray Data is tailored for machine learning, and its design philosophy aligns closely with the machine learning workflow. It mainly encompasses:

Data loading and storage
Data transformation
Machine learning feature preprocessing
Tight integration of datasets and machine learning models

Dataset#

Ray Data is based on the ray.data.Dataset object. The Dataset is a distributed data object, with the basic unit being a Block. The Dataset is a distributed ObjectRef[Block] consisting of multiple Blocks. A Block is a data structure based on the Apache Arrow format.

Fig. 8.1 is illustrative of a dataset composed of three Blocks, each containing 1,000 rows of data.

../_images/dataset-arch.svg — Fig. 8.1 Ray Dataset#

We can use from_*() APIs to import data from other systems or formats into a Dataset, such as from_pandas(), from_spark(). We can use read_*() APIs to read data from persistent file systems, such as read_parquet(), read_json(), and so on.

Data Operations and Underlying Implementation#

Data Reading and Writing#

As shown in Fig. 8.2, Ray Data utilizes Ray tasks to read and write data in parallel. The concept behind Ray tasks is straightforward—each task reads a small portion of data, yielding multiple Blocks. The parallelism during reading can be adjusted by setting the parallelism.

../_images/dataset-read.svg — Fig. 8.2 Illustration of Reading in Ray Data#

Data Transformation#

As illustrated in Fig. 8.3, data transformation operations utilize Ray tasks or Ray actors to operate on each Block. For stateless transformation operations, the underlying implementation is Ray tasks, while for stateful transformation operations, Ray actors are used.

../_images/dataset-map.svg — Fig. 8.3 Illustration of transformations in Ray Data#

Next, we will delve into the details of several types of data operations.