Parallelization
Georgetown University
Fall 2025
multiprocessing
moduleTerm | Definition |
---|---|
Local | Your current workstation (laptop, desktop, etc.), wherever you start the terminal/console application. |
Remote | Any machine you connect to via ssh or other means. |
EC2 | Single virtual machine in the cloud where you can run computation (ephemeral) |
SageMaker | Integrated Developer Environment where you can conduct data science on single machines |
Ephemeral | Lasting for a short time - any machine that will get turned off or place you will lose data |
Persistent | Lasting for a long time - any environment where your work is NOT lost when the timer goes off |
From a data science perspective
It’s easy to speed things up when:
Just run multiple calculations at the same time
The concept is based on the old middle/high school math problem:
If 5 people can shovel a parking lot in 6 hours, how long will it take 100 people to shovel the same parking lot?
Basic idea is that many hands (cores/instances) make lighter (faster/more efficient) work of the same problem, as long as the effort can be split up appropriately into nearly equal parcels
But this has become easier with better software, like the multiprocessing
module in Python
\[ \lim_{s\rightarrow\infty} S_{latency} = \frac{1}{1-p} \]
where \(s\) is the speedup of that part of the task (which is \(p\) proportion of the overall task) benefitting from improved resources.
If 50% of the task is embarassingly parallel, you can get a maximum speedup of 2-fold, while if 90% is embarassingly parallel, you can get a maximum speedup of \(1/(1-0.9) = 10\) fold.
For processes in the “No” column, each step depends on a previous step, and so they cannot be parallelized. However, there are approximate numerical methods applicable to big data which are parallelizable and get you to the right answer, based on parallely taking random subsets of the data. We’ll see some of these when we look at Spark ML
Higher efficiency
Using modern infrastructure
Scalable to larger data, more complex procedures
There are good solutions today for most of the cons, so the pros have it and so this paradigm is widely accepted and implemented
A lot of these components are data engineering and DevOps issues
Infrastructures have standardized many of these and have helped data scientists implement parallel programming much more easily
We’ll see in the lab how the multiprocessing
module in Python makes parallel processing on a machine quite easy to implement
This paradigm is often referred to as a map-reduce framework, or, more descriptively, the split-apply-combine paradigm
The map operation is a 1-1 operation that takes each split and processes it
The map operation keeps the same number of objects in its output that were present in its input
The operations included in a particular map can be quite complex, involving multiple steps. In fact, you can implement a pipeline of procedures within the map step to process each data object.
The main point is that the same operations will be run on each data object in the map implementation
Some examples of a map operations are
The reduce operation takes multiple objects and reduces them to a (perhaps) smaller number of objects using transformations that aren’t amenable to the map paradigm.
These transformations are often serial/linear in nature
The reduce transformation is usually the last, not-so-elegant transformation needed after most of the other transformations have been efficiently handled in a parallel fashion by map
The reduce operation requires
Programmatically, this can be written as
The reduce operation works serially from “left” to “right”, passing each object successively through the accumulator function.
For example, if we were to add successive numbers with a function called add
…
Some examples:
Combining the map and reduce operations creates a powerful pipeline that can handle a diverse range of problems in the Big Data context
One of the issues here is, how to split the data in a “good” manner so that the map-reduce framework works well
multiprocessing
modulemultiprocessing
modulePython has the Global Interpretor Lock (GIL) which only allows only one thread to interact with Python objects at a time. So the way to parallel process in Python is to do multi-processor parallelization, where we run multiple Python interpretors across multiple processes, each with its own private memory space and GIL.
multiprocessing
1A forked copy of the current process; this creates a new process identifier, and the task runs as an independent child process in the operating system
Wraps the Process
into a convenient pool of workers that share a chunk of work and return an aggregated result
joblib
modulescikit-learn
functions have implicit parallelization baked in through the n_jobs
parameterFor example
from sklearn.ensembles import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, random_state = 124, n_jobs=-1)
uses the joblib
module to use all available processors (n_jobs=-1
) to do the bootstrapping
Async I/O
When talking to external systems (databases, APIs) the bottleneck is not local CPU/memory but rather the time it takes to receive a response from the external system.
The Async I/O model addresses this by allowing to send multiple request in parallel without having to wait for a response.
References: asyncio — Asynchronous I/O, Async IO in Python: A Complete Walkthrough
Parallelism: multiple tasks are running in parallel, each on a different processors. This is done through the multiprocessing
module.
Concurrency: multiple tasks are taking turns to run on the same processor. Another task can be scheduled while the current one is blocked on I/O.
Threading: multiple threads take turns executing tasks. One process can contain multiple threads. Similar to concurrency but within the context of a single process.
asyncio
asyncio
is a library to write concurrent code using the async/await syntax.
Use of async
and await
keywords. You can call an async
function multiple times while you await
the result of a previous invocation.
await
the result of multiple async
tasks using gather
.
The main
function in this example is called a coroutine
. Multiple coroutines
can be run concurrnetly
as awaitable tasks
.
asyncio
Python programWrite a regular Python function that makes a call to a database, an API or any other blocking functionality as you normally would.
Create a coroutine i.e. an async
wrapper function to the blocking function using async
and await
, with the call to blocking function made using the asyncio.to_thread
function. This enables the coroutine execution in a separate thread.
Create another coroutine that makes multiple calls (in a loop, list comprehension) to the async
wrapper created in the previous step and awaits completion of all of the invocations using the asyncio.gather
function.
Call the coroutine created in the previous step from another function using the asyncio.run
function.
import time
import asyncio
def my_blocking_func(i):
# some blocking code such as an api call
print(f"{i}, entry")
time.sleep(1)
print(f"{i}, exiting")
return None
async def async_my_blocking_func(i: int):
return await asyncio.to_thread(my_blocking_func, i)
async def async_my_blocking_func_for_multiple(n: int):
return await asyncio.gather(*[async_my_blocking_func(i) for i in range(n)])
if __name__ == "__main__":
# async version
s = time.perf_counter()
n = 20
brewery_counts = asyncio.run(async_my_blocking_func_for_multiple(n))
elapsed_async = time.perf_counter() - s
print(f"{__file__}, async_my_blocking_func_for_multiple finished in {elapsed_async:0.2f} seconds")