Parallelization in the AI Era
Georgetown University
Fall 2025
multiprocessing
moduleTerm | Definition |
---|---|
Local | Your current workstation (laptop, desktop, etc.), wherever you start the terminal/console application. |
Remote | Any machine you connect to via ssh or other means. |
EC2 | Single virtual machine in the cloud where you can run computation (ephemeral) |
SageMaker | Integrated Developer Environment where you can conduct data science on single machines or distributed training |
GPU | Graphics Processing Unit - specialized hardware for parallel computation, essential for AI/ML |
TPU | Tensor Processing Unit - Google’s custom AI accelerator chips |
Ephemeral | Lasting for a short time - any machine that will get turned off or place you will lose data |
Persistent | Lasting for a long time - any environment where your work is NOT lost when the timer goes off |
From a data engineering for AI perspective
It’s easy to speed things up when:
Just run multiple data processing tasks at the same time
The concept is based on the old middle/high school math problem:
If 5 people can shovel a parking lot in 6 hours, how long will it take 100 people to shovel the same parking lot?
Basic idea is that many hands (cores/instances) make lighter (faster/more efficient) work of the same problem, as long as the effort can be split up appropriately into nearly equal parcels
\[ \lim_{s\rightarrow\infty} S_{latency} = \frac{1}{1-p} \]
Data Pipeline Examples: - If 80% is parallel (processing) and 20% is sequential (I/O), max speedup = 5x - If 95% is parallel (embedding generation), max speedup = 20x
Data Preparation: - Text extraction from documents (PDFs, HTML) - Tokenization of text corpora - Image preprocessing and augmentation - Embedding generation for documents - Data quality filtering and validation - Format conversions (audio → features) - Web scraping and data collection - Synthetic data generation
Data Processing: - Batch inference on datasets - Feature extraction at scale - Data deduplication (local) - Parallel chunk processing
Order-Dependent: - Conversation threading - Time-series preprocessing - Sequential data validation - Cumulative statistics
Global Operations: - Global deduplication - Cross-dataset joins - Sorting entire datasets - Computing exact quantiles
Small Data: - Config file processing - Metadata operations - Single document processing
For data operations in the “No” column, they often require global coordination or maintain strict ordering. However, many can be approximated with parallel algorithms (like approximate deduplication with locality-sensitive hashing or approximate quantiles with t-digest)
Higher efficiency
Using modern infrastructure
Scalable to larger data, more complex procedures
There are good solutions today for most of the cons, so the pros have it and so this paradigm is widely accepted and implemented
A lot of these components are data engineering and DevOps issues
Infrastructures have standardized many of these and have helped data scientists implement parallel programming much more easily
We’ll see in the lab how the multiprocessing
module in Python makes parallel processing on a machine quite easy to implement
These operations are embarrassingly parallel - each document/sample can be processed independently
This paradigm is often referred to as a map-reduce framework, or, more descriptively, the split-apply-combine paradigm
The map operation is a 1-1 operation that takes each split and processes it
The map operation keeps the same number of objects in its output that were present in its input
The operations included in a particular map can be quite complex, involving multiple steps. In fact, you can implement a pipeline of procedures within the map step to process each data object.
The main point is that the same operations will be run on each data object in the map implementation
Some examples of map operations are:
Traditional Data Analytics: 1. Extracting a standard table from online reports from multiple years 1. Extracting particular records from multiple JSON objects 1. Transforming data (as opposed to summarizing it) 1. Run a normalization script on each transcript in a GWAS dataset 1. Standardizing demographic data for each of the last 20 years against the 2000 US population
AI Data Processing (2025): 1. Text processing: Tokenizing millions of documents for LLM training 1. Embedding generation: Converting each document to a vector representation 1. Data extraction: Extracting text from millions of PDFs using OCR 1. Quality filtering: Applying toxicity/bias filters to each text sample 1. Image preprocessing: Resizing and normalizing images for vision models 1. Synthetic data: Generating training examples from prompts using LLMs
The reduce operation takes multiple objects and reduces them to a (perhaps) smaller number of objects using transformations that aren’t amenable to the map paradigm.
These transformations are often serial/linear in nature
The reduce transformation is usually the last, not-so-elegant transformation needed after most of the other transformations have been efficiently handled in a parallel fashion by map
The reduce operation requires
Programmatically, this can be written as
The reduce operation works serially from “left” to “right”, passing each object successively through the accumulator function.
For example, if we were to add successive numbers with a function called add
…
Some examples:
Traditional Analytics: 1. Finding the common elements (intersection) of a large number of sets 1. Computing a table of group-wise summaries 1. Filtering 1. Tabulating
AI Data Processing (2025): 1. Deduplication: Finding unique documents across billions of samples 1. Aggregating statistics: Computing dataset quality metrics 1. Vocabulary building: Creating token vocabularies from text corpus 1. Index building: Merging vector embeddings into searchable indices 1. Data validation: Aggregating quality scores across partitions
Combining the map and reduce operations creates a powerful pipeline that can handle a diverse range of problems in the Big Data context
One of the issues here is, how to split the data in a “good” manner so that the map-reduce framework works well
multiprocessing
modulemultiprocessing
modulePython has the Global Interpretor Lock (GIL) which only allows only one thread to interact with Python objects at a time. So the way to parallel process in Python is to do multi-processor parallelization, where we run multiple Python interpretors across multiple processes, each with its own private memory space and GIL.
multiprocessing
1A forked copy of the current process; this creates a new process identifier, and the task runs as an independent child process in the operating system
Wraps the Process
into a convenient pool of workers that share a chunk of work and return an aggregated result
joblib
modulescikit-learn
functions have implicit parallelization baked in through the n_jobs
parameterFor example
from sklearn.ensembles import RandomForestClassifier
clf = RandomForestClassifier(n_estimators = 100, random_state = 124, n_jobs=-1)
uses the joblib
module to use all available processors (n_jobs=-1
) to do the bootstrapping
from multiprocessing import Pool
from transformers import AutoTokenizer
# Parallel tokenization for LLM training
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def process_document(text):
# Clean and tokenize text
tokens = tokenizer(text, truncation=True, max_length=512)
return tokens
# Process millions of documents in parallel
with Pool(processes=32) as pool:
tokenized_docs = pool.map(process_document, documents)
Async I/O
in AI Data ProcessingWhen talking to external systems (databases, APIs, LLM services) the bottleneck is not local CPU/memory but rather the time it takes to receive a response from the external system.
The Async I/O model addresses this by allowing to send multiple request in parallel without having to wait for a response.
AI Use Cases:
References: asyncio — Asynchronous I/O, Async IO in Python: A Complete Walkthrough
Parallelism: multiple tasks are running in parallel, each on a different processors. This is done through the multiprocessing
module.
Concurrency: multiple tasks are taking turns to run on the same processor. Another task can be scheduled while the current one is blocked on I/O.
Threading: multiple threads take turns executing tasks. One process can contain multiple threads. Similar to concurrency but within the context of a single process.
asyncio
asyncio
is a library to write concurrent code using the async/await syntax.
Use of async
and await
keywords. You can call an async
function multiple times while you await
the result of a previous invocation.
await
the result of multiple async
tasks using gather
.
The main
function in this example is called a coroutine
. Multiple coroutines
can be run concurrnetly
as awaitable tasks
.
asyncio
Python programWrite a regular Python function that makes a call to a database, an API or any other blocking functionality as you normally would.
Create a coroutine i.e. an async
wrapper function to the blocking function using async
and await
, with the call to blocking function made using the asyncio.to_thread
function. This enables the coroutine execution in a separate thread.
Create another coroutine that makes multiple calls (in a loop, list comprehension) to the async
wrapper created in the previous step and awaits completion of all of the invocations using the asyncio.gather
function.
Call the coroutine created in the previous step from another function using the asyncio.run
function.
import time
import asyncio
def my_blocking_func(i):
# some blocking code such as an api call
print(f"{i}, entry")
time.sleep(1)
print(f"{i}, exiting")
return None
async def async_my_blocking_func(i: int):
return await asyncio.to_thread(my_blocking_func, i)
async def async_my_blocking_func_for_multiple(n: int):
return await asyncio.gather(*[async_my_blocking_func(i) for i in range(n)])
if __name__ == "__main__":
# async version
s = time.perf_counter()
n = 20
brewery_counts = asyncio.run(async_my_blocking_func_for_multiple(n))
elapsed_async = time.perf_counter() - s
print(f"{__file__}, async_my_blocking_func_for_multiple finished in {elapsed_async:0.2f} seconds")
import asyncio
import aiohttp
async def call_llm_api(prompt, session):
"""Make async call to LLM API"""
async with session.post(
"https://api.example.com/generate",
json={"prompt": prompt, "max_tokens": 100}
) as response:
return await response.json()
async def process_prompts(prompts):
"""Process multiple prompts in parallel"""
async with aiohttp.ClientSession() as session:
tasks = [call_llm_api(prompt, session) for prompt in prompts]
results = await asyncio.gather(*tasks)
return results
# Generate synthetic data using parallel API calls
prompts = ["Generate a customer review for...",
"Create technical documentation for...",
"Write a dialogue about..."] * 100
results = asyncio.run(process_prompts(prompts))
Tip
This approach can speed up synthetic data generation by 10-100x compared to sequential API calls
“Data is the new oil, but it needs refining” - especially for AI
# Example: Parallel document processing for RAG system
from multiprocessing import Pool
import pandas as pd
from sentence_transformers import SentenceTransformer
def process_document(doc):
# Extract text
text = extract_text(doc)
# Clean and normalize
cleaned = clean_text(text)
# Chunk into passages
chunks = chunk_text(cleaned, max_length=512)
# Generate embeddings
embeddings = model.encode(chunks)
return {'doc_id': doc.id, 'chunks': chunks, 'embeddings': embeddings}
# Process millions of documents in parallel
with Pool(processes=32) as pool:
results = pool.map(process_document, documents)
# Map: Extract features from each document
# Reduce: Aggregate statistics across corpus
word_counts = data.map(tokenize).reduce(aggregate_counts)
# Process TB-scale datasets
df = spark.read.parquet("s3://bucket/training_data/")
cleaned = df.filter(col("quality_score") > 0.8) \
.select(clean_text_udf("text").alias("cleaned_text"))
cleaned.write.parquet("s3://bucket/cleaned_data/")
# Distributed embedding generation
from transformers import AutoModel
import torch.distributed as dist
def generate_embeddings_distributed(documents, rank, world_size):
# Each GPU processes a subset
subset = documents[rank::world_size]
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = model.to(f"cuda:{rank}")
embeddings = []
for batch in batch_iterator(subset, batch_size=256):
with torch.no_grad():
emb = model.encode(batch)
embeddings.extend(emb)
return embeddings
# MinHash for approximate deduplication
from datasketch import MinHash, MinHashLSH
def parallel_dedup(documents):
# Step 1: Generate MinHash signatures (parallel)
signatures = parallel_map(generate_minhash, documents)
# Step 2: LSH for finding near-duplicates
lsh = MinHashLSH(threshold=0.9)
for doc_id, sig in signatures:
lsh.insert(doc_id, sig)
# Step 3: Filter duplicates (parallel)
unique_docs = parallel_filter(is_unique, documents, lsh)
return unique_docs
# Generate training data using LLMs
prompts = [
"Generate a customer service dialogue about...",
"Create a technical documentation for...",
"Write a product review for..."
]
# Parallel generation across multiple API calls
with ThreadPoolExecutor(max_workers=100) as executor:
futures = []
for prompt in prompts:
future = executor.submit(generate_with_llm, prompt)
futures.append(future)
synthetic_data = [f.result() for f in futures]
Tool | Best For | Scalability | Learning Curve |
---|---|---|---|
multiprocessing | Single machine | 10s of cores | Low |
Spark | Distributed batch | 1000s of nodes | Medium |
Ray | ML workloads | 100s of nodes | Low |
Dask | Python-native | 100s of nodes | Low |
Beam | Stream + batch | 1000s of nodes | High |
# Stage 1: Parallel document processing
processed_docs = ray.data.read_parquet("raw_docs/") \
.map_batches(extract_and_clean) \
.map_batches(chunk_documents)
# Stage 2: Distributed embedding generation
embeddings = processed_docs.map_batches(
generate_embeddings,
num_gpus=1,
batch_size=100
)
# Stage 3: Build index (using FAISS)
index = build_distributed_index(embeddings)