Accelerate Python workloads with Ray
Georgetown University
Fall 2023
Credit limit - $100
Course numbers:
STAY WITH COURSE 54380 UNLESS YOU HAVE RUN OUT OF CREDITS OR >$90 USED!
Note that you will have to repeat several setup steps:
Ray is an open-source unified compute framework that makes it easy to scale AI and Python workloads — from reinforcement learning to deep learning to tuning, and model serving. Learn more about Ray’s rich set of libraries and integrations.
To scale any Python workload from laptop to cloud
Remember the Spark ecosystem with all its integrations…
Hadoop (fault tolerance, resiliency using commodity hardware) -> Spark (in memory data processing) -> Ray (asynch processing, everything you need for scaling AI workloads)
Ray Core Ray Core provides a small number of core primitives (i.e., tasks, actors, objects) for building and scaling distributed applications.
Ray Datasets
Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such as maps
(map_batches), global and grouped aggregations
(GroupedDataset), and shuffling operations
(random_shuffle, sort, repartition), and are compatible with a variety of file formats, data sources, and distributed frameworks.
Ray enables arbitrary functions to be executed asynchronously on separate Python workers.
Such functions are called Ray remote functions and their asynchronous invocations are called Ray tasks.
# By adding the `@ray.remote` decorator, a regular Python function
# becomes a Ray remote function.
@ray.remote
def my_function():
# do something
time.sleep(10)
return 1
# To invoke this remote function, use the `remote` method.
# This will immediately return an object ref (a future) and then create
# a task that will be executed on a worker process.
obj_ref = my_function.remote()
# The result can be retrieved with ``ray.get``.
assert ray.get(obj_ref) == 1
# Specify required resources.
@ray.remote(num_cpus=4, num_gpus=2)
def my_other_function():
return 1
# Ray tasks are executed in parallel.
# All computation is performed in the background, driven by Ray's internal event loop.
for _ in range(4):
# This doesn't block.
my_function.remote()
@ray.remote(num_cpus=2, num_gpus=0.5)
class Counter(object):
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value
def get_counter(self):
return self.value
# Create an actor from this class.
counter = Counter.remote()
# Call the actor.
obj_ref = counter.increment.remote()
assert ray.get(obj_ref) == 1
Tasks and actors create and compute on objects.
Objects are referred as remote objects because they can be stored anywhere in a Ray cluster
Remote objects are cached in Ray’s distributed shared-memory object store
There is one object store per node in the cluster.
Ray Datasets are the standard way to load and exchange data in Ray libraries and applications.
They provide basic distributed data transformations such as maps
, global
and grouped aggregations
, and shuffling operations
.
Compatible with a variety of file formats, data sources, and distributed frameworks.
Ray Datasets are designed to load and preprocess data for distributed ML training pipelines.
Datasets simplify general purpose parallel GPU and CPU compute in Ray.
import ray
dataset = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")
dataset.show(limit=1)
import pandas as pd
# Find rows with spepal length < 5.5 and petal length > 3.5.
def transform_batch(df: pd.DataFrame) -> pd.DataFrame:
return df[(df["sepal length (cm)"] < 5.5) & (df["petal length (cm)"] > 3.5)]
transformed_dataset = dataset.map_batches(transform_batch)
print(transformed_dataset)
DSAN 6000 | Fall 2023 | https://gu-dsan.github.io/6000-fall-2023/