Lecture 11

Accelerate Python workloads with Ray

Amit Arora, Abhijit Dasgupta, Anderson Monken, and Marck Vaisman

Georgetown University

Fall 2023

Logistics and Review

Deadlines

  • Lab 7: Spark DataFrames Due Oct 10 6pm
  • Lab 8: SparkNLP Due Oct 17 6pm
  • Assignment 6: Spark (Multi-part) Due Oct 23 11:59pm
  • Lab 9: SparkML Due Oct 24 6pm
  • Lab 10: Spark Streaming Due Oct 31 6pm
  • Project: First Milestone: Due Nov 6 11:59pm
  • Lab 11: Ray Due Nov 14 6pm NO LAB SUBMISSION

Look back and ahead

  • Spark RDDs & DataFrames, Spark UDFs, Spark NLP, SparkML, Spark Streaming (5-weeks of Spark) and Project
  • Next week: Dask and more Project

Review of Prior Topics

AWS Academy

  • Credit limit - $100

  • Course numbers:

    • Course #1 - 54380
    • Course #2 - 54378

STAY WITH COURSE 54380 UNLESS YOU HAVE RUN OUT OF CREDITS OR >$90 USED!

Note that you will have to repeat several setup steps:

  • All remaining labs to be done in AWS
  • SageMaker domain set up (review Slack channel to review solutions to issues)
  • any S3 uploading or copying as well as bucket creation as necessary

Ray

Ray is an open-source unified compute framework that makes it easy to scale AI and Python workloads — from reinforcement learning to deep learning to tuning, and model serving. Learn more about Ray’s rich set of libraries and integrations.

Why do we need Ray?

  • To scale any Python workload from laptop to cloud

  • Remember the Spark ecosystem with all its integrations…

  • Hadoop (fault tolerance, resiliency using commodity hardware) -> Spark (in memory data processing) -> Ray (asynch processing, everything you need for scaling AI workloads)

What can we do with Ray?

What comes in the box?

Our focus today …

Ray Core Ray Core provides a small number of core primitives (i.e., tasks, actors, objects) for building and scaling distributed applications.

https://docs.ray.io/en/latest/ray-core/walkthrough.html

Ray Datasets

Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. They provide basic distributed data transformations such as maps (map_batches), global and grouped aggregations (GroupedDataset), and shuffling operations (random_shuffle, sort, repartition), and are compatible with a variety of file formats, data sources, and distributed frameworks.

https://docs.ray.io/en/latest/data/dataset.html

Ray Core - Tasks

  • Ray enables arbitrary functions to be executed asynchronously on separate Python workers.

  • Such functions are called Ray remote functions and their asynchronous invocations are called Ray tasks.

# By adding the `@ray.remote` decorator, a regular Python function
# becomes a Ray remote function.
@ray.remote
def my_function():
    # do something
    time.sleep(10)
    return 1

# To invoke this remote function, use the `remote` method.
# This will immediately return an object ref (a future) and then create
# a task that will be executed on a worker process.
obj_ref = my_function.remote()

# The result can be retrieved with ``ray.get``.
assert ray.get(obj_ref) == 1

# Specify required resources.
@ray.remote(num_cpus=4, num_gpus=2)
def my_other_function():
    return 1

# Ray tasks are executed in parallel.
# All computation is performed in the background, driven by Ray's internal event loop.
for _ in range(4):
    # This doesn't block.
    my_function.remote()

Ray Core - Actors

  • Actors extend the Ray API from functions (tasks) to classes.
  • An actor is a stateful worker (or a service).
  • When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker.
@ray.remote(num_cpus=2, num_gpus=0.5)
class Counter(object):
    def __init__(self):
        self.value = 0

    def increment(self):
        self.value += 1
        return self.value

    def get_counter(self):
        return self.value

# Create an actor from this class.
counter = Counter.remote()
# Call the actor.
obj_ref = counter.increment.remote()
assert ray.get(obj_ref) == 1

Ray Core - Objects

  • Tasks and actors create and compute on objects.

  • Objects are referred as remote objects because they can be stored anywhere in a Ray cluster

    • We use object refs to refer to them.
  • Remote objects are cached in Ray’s distributed shared-memory object store

  • There is one object store per node in the cluster.

    • In the cluster setting, a remote object can live on one or many nodes, independent of who holds the object ref(s).

Note

Remote objects are immutable. That is, their values cannot be changed after creation. This allows remote objects to be replicated in multiple object stores without needing to synchronize the copies.

Ray Core - Objects (contd.)

# Put an object in Ray's object store.
y = 1
object_ref = ray.put(y)

# Get the value of one object ref.
obj_ref = ray.put(1)
assert ray.get(obj_ref) == 1

# Get the values of multiple object refs in parallel.
assert ray.get([ray.put(i) for i in range(3)]) == [0, 1, 2]

Ray Datasets

  • Ray Datasets are the standard way to load and exchange data in Ray libraries and applications.

  • They provide basic distributed data transformations such as maps, global and grouped aggregations, and shuffling operations.

  • Compatible with a variety of file formats, data sources, and distributed frameworks.

  • Ray Datasets are designed to load and preprocess data for distributed ML training pipelines.

  • Datasets simplify general purpose parallel GPU and CPU compute in Ray.

    • Provide a higher-level API for Ray tasks and actors for such embarrassingly parallel compute, internally handling operations like batching, pipelining, and memory management.

Ray Datasets

  • Create
import ray

dataset = ray.data.read_csv("s3://anonymous@air-example-data/iris.csv")

dataset.show(limit=1)
  • Transform
import pandas as pd

# Find rows with spepal length < 5.5 and petal length > 3.5.
def transform_batch(df: pd.DataFrame) -> pd.DataFrame:
    return df[(df["sepal length (cm)"] < 5.5) & (df["petal length (cm)"] > 3.5)]

transformed_dataset = dataset.map_batches(transform_batch)
print(transformed_dataset)
  • Consume
batches = transformed_dataset.iter_batches(batch_size=8)
print(next(iter(batches)))
  • Save
import os

transformed_dataset.write_parquet("iris")

print(os.listdir("iris"))

References

Lab