8.1. Ray Data Overview#
Ray Data is a data processing framework built on top of Ray Core, primarily addressing data preparation and processing tasks related to machine learning training or inference, or the “last mile” from storage to distributed applications.
Ray Data provides an abstract class for data, ray.data.Dataset
, offering common primitives for big data processing on the Dataset
. It covers most stages of data processing, including:
Data loading, such as reading Parquet files.
Transformation operations on data, such as
map_batches()
.Grouping and aggregation operations, such as
groupby()
.Data shuffling, such as
random_shuffle()
andrepartition()
.
Key Concepts#
Ray Data is tailored for machine learning, and its design philosophy aligns closely with the machine learning workflow. It mainly encompasses:
Data loading and storage
Data transformation
Machine learning feature preprocessing
Tight integration of datasets and machine learning models
Dataset#
Ray Data is based on the ray.data.Dataset
object. The Dataset
is a distributed data object, with the basic unit being a Block
. The Dataset
is a distributed ObjectRef[Block]
consisting of multiple Block
s. A Block
is a data structure based on the Apache Arrow format.
Fig. 8.1 is illustrative of a dataset composed of three Block
s, each containing 1,000 rows of data.
We can use from_*()
APIs to import data from other systems or formats into a Dataset
, such as from_pandas()
, from_spark()
. We can use read_*()
APIs to read data from persistent file systems, such as read_parquet()
, read_json()
, and so on.
Data Operations and Underlying Implementation#
Data Reading and Writing#
As shown in Fig. 8.2, Ray Data utilizes Ray tasks to read and write data in parallel. The concept behind Ray tasks is straightforward—each task reads a small portion of data, yielding multiple Block
s. The parallelism during reading can be adjusted by setting the parallelism
.
Data Transformation#
As illustrated in Fig. 8.3, data transformation operations utilize Ray tasks or Ray actors to operate on each Block
. For stateless transformation operations, the underlying implementation is Ray tasks, while for stateful transformation operations, Ray actors are used.
Next, we will delve into the details of several types of data operations.