Lecture 2

Course overview. Introduction to big data concepts. The Cloud.

Amit Arora, Jeff Jacobs

Georgetown University

Fall 2025

Look back

Great use of Slack
Big data definition
Used the shell in Linux on a virtual machine through Codespaces

Agenda and Goals for Today

Quick tour of the cloud services that are used in the course
Extended Lab:
- Setting up AWS accounts
- Starting VMs in the cloud and connecting to them

Glossary

Term	Definition
Local	Your current workstation (laptop, desktop, etc.), wherever you start the terminal/console application.
Remote	Any machine you connect to via ssh or other means.

Working on a single machine

You are most likely using traditional data analysis tools, which are single threaded and run on a single machine.

The BIG DATA problem

Is Moore’s Law Dead?

New Hardware

Need

The demand for data processing will not be met by relying on the same technology.
The key to modern data processing is new semiconductors
- Not just squeezing more transistors per area
- Need new compute architectures that are built and optimized for specialized functions
Specialized edge hardware for Edge Computing
While many declare Moore’s Law to be broken or no longer valid, in reality it’s not the law that is broken but rather a heat problem.

What

Graphic Processing Units (GPUs)
Field Programmable Gate Arrays (FPGAs)
Data Processing Units (DPUs)
Photonic computing

So, we can’t store or process data on a single machine, what do we do?

We distribute

More CPUs, more memory, more storage!

How do we do that?

Simple, we use the cloud

Cloud computing is a big deal!

Benefits

Provides access to low-cost computing
Costs are decreasing every year
Elastic
PAAS works!
Many other benefits…

What is the claaaaaaawd (the cloud)

What is the cloud?

\kloud\ noun

the practice of storing regularly used computer data on multiple servers that can be accessed through the Internet

Using someone else’s computer(s)

NIST Definition

Service Models

Infrastructure as a Service (IaaS)

What is IaaS?

Virtualized computing resources delivered over the internet
Provider manages: Physical hardware, virtualization, networking, storage
You manage: Operating systems, applications, runtime, data, middleware

Examples & Use Cases

Amazon EC2, Microsoft Azure VMs, Google Compute Engine
Perfect for: Development environments, web hosting, backup & recovery
Benefits: Rapid scaling, pay-as-you-go, global availability

# Launch a virtual machine in seconds
aws ec2 run-instances --image-id ami-12345 --instance-type t3.xlarge

Platform as a Service (PaaS)

What is PaaS?

Complete development and deployment environment in the cloud
Provider manages: Infrastructure, operating systems, runtime environments
You manage: Applications and data

Examples & Use Cases

AWS Elastic Beanstalk, Google App Engine, Microsoft Azure App Service
Perfect for: Web applications, API development, microservices
Benefits: Faster development, automatic scaling, integrated DevOps

# Deploy your app with minimal configuration
git push heroku main  # App automatically deployed and scaled

Software as a Service (SaaS)

What is SaaS?

Complete applications delivered over the internet
Provider manages: Everything (infrastructure, platform, software)
You manage: Your data and user access

Examples & Use Cases

Gmail, Salesforce, Microsoft 365, Slack, Zoom
Perfect for: Business applications, collaboration tools, CRM
Benefits: No installation, automatic updates, accessible anywhere

The SaaS Revolution

80% of companies use SaaS applications
$195 billion market size (2023)

Cloud Infrastructure and the AI Revolution

Why Cloud + AI = Perfect Match

Massive Computational Requirements
- Training large language models requires thousands of GPUs
- Cloud provides elastic access to specialized hardware
Data Storage & Processing
- AI models need petabytes of training data
- Cloud offers unlimited, globally distributed storage
Cost Efficiency
- Pay-per-use model for expensive AI hardware
- No upfront investment in GPU clusters

The Data Center Explosion

The Numbers

2024: Approximately 12,000 data centers worldwide (not millions)
Power consumption: Over 400 TWh annually (about 1–1.3% of global electricity use)
Growth rate: 8–11% per year (market CAGR), led by AI and cloud computing

What’s Inside Modern Data Centers?

Specialized AI Chips: NVIDIA H100s, Google TPUs, custom silicon
Massive Scale: Facebook (Meta) data centers operate hundreds of thousands of servers worldwide
Global Networks: Hyperscale data centers leverage sub-second (near real-time) global data synchronization using high-speed optical networks

Environmental Impact

Cloud leaders pledge carbon neutrality by 2030 (Google, Microsoft) or 2025 (AWS)
Rapid innovation in cooling (liquid, AI optimization), renewable energy adoption, and efficient chip design is helping curb environmental impact

References

ABI Research data center statistics: https://www.abiresearch.com/blog/data-centers-by-region-size-company
Data center count and growth: https://brightlio.com/data-center-stats/
Data centers by country: https://www.statista.com/statistics/1228433/data-centers-worldwide-by-country/
Number of colocation and hyperscale sites: https://www.linkedin.com/pulse/how-many-data-centers-world-where-built-heidi-fan-fkjxc
Data center power consumption: https://www.datacenterdynamics.com/en/news/iea-data-center-energy-consumption-set-to-double-by-2030-to-945twh/
Data center projections and market growth: https://finance.yahoo.com/news/data-center-market-size-hits-103000410.html
AI hardware in data centers: https://www.ibm.com/think/topics/ai-data-center
Facebook server estimates: https://www.datacenterknowledge.com/hyperscalers/the-facebook-data-center-faq, https://www.datacenterfrontier.com/hyperscale/article/11427952/facebook-showcases-its-40-million-square-feet-of-global-data-centers
Data center network technology: https://www.datacenterdynamics.com/en/opinions/data-center-bandwidth-continues-to-climb-and-its-not-coming-back-down-any-time-soon/
Cloud provider carbon neutrality: https://www.techmahindra.com/insights/views/sustainability-in-cloud/, https://bioscore.com/blog/green-cloud-computing-a-comprehensive-comparison-of-cloud-providers-sustainability-efforts
Data center cooling innovations: https://www.utilitydive.com/news/2025-outlook-data-center-cooling-electricity-demand-ai-dual-phase-direct-to-chip-energy-efficiency/738120/, https://blog.equinix.com/blog/2024/08/27/how-liquid-cooling-enables-the-next-generation-of-tech-innovation/
Comprehensive energy estimates: https://www.devsustainability.com/p/data-center-energy-and-ai-in-2025
Data center growth drivers and energy: https://www.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2025/genai-power-consumption-creates-need-for-more-sustainable-data-centers.html

GPU Infrastructure: The AI Powerhouse

From Gaming to AI

Originally: Graphics processing for video games
Now: Parallel processing powerhouse for AI/ML workloads
Architecture: Thousands of cores vs. CPU’s 8-64 cores

Cloud GPU Offerings

# AWS GPU instances for different AI workloads
p4d.24xlarge   # 8x NVIDIA A100 (40GB) - $32/hour
p3.16xlarge    # 8x NVIDIA V100 (16GB) - $24/hour
g4dn.xlarge    # 1x NVIDIA T4 (16GB)   - $0.50/hour

The Scale Challenge

GPT-4 training: Estimated 25,000 A100 GPUs for 6 months
Meta’s AI infrastructure: 350,000+ NVIDIA H100 chips
Cost: Single AI training run can cost $100M+

Distributed Computing and Large Clusters

Modern AI Requires Massive Clusters

Model parallelism: Split models across multiple GPUs
Data parallelism: Process different data batches simultaneously
Pipeline parallelism: Different layers on different GPUs

Technologies Enabling Scale

Kubernetes: Container orchestration for microservices
Apache Spark: Distributed data processing framework
Ray: Distributed AI/ML framework for Python

# Distributed training with Ray
import ray
from ray import tune

# Scale across 1000+ machines automatically
tune.run(train_model, resources_per_trial={"gpu": 8})

Data Storage Evolution

The Data Tsunami

2025 projection: 181 zettabytes of data globally
AI training datasets: CommonCrawl (800TB), ImageNet (150GB)
Real-time requirements: 1-10ms latency for inference and in some cases Sub-millisecond

Storage Technologies

Object Storage: Amazon S3, Google Cloud Storage (petabyte scale)
High-performance: NVMe SSDs, persistent memory
Distributed filesystems: HDFS, GlusterFS, Ceph

The Economics of Data

Storage costs: Dropped 99.9% since 1980
Transfer costs: Still significant for large datasets
Edge computing: Bringing compute closer to data sources

The evolution of the Cloud

Yesterday	Today	Tomorrow
Limited number of tools and vendors	Many tools and vendors to work with	Integrated tools and vendors
One platform - few devices	Multiple platforms - many devices	Connected platforms and devices
Data is scarce but manageable	Overabundance of data	Data is used for important business decisions
IT has major influence and control	IT has limited influence and control	IT is strategic to the business
People only work when they are at work	People work wherever they want	People have access to what they need, wherever they are

What does the cloud look like?

Virtual Visit to a Microsoft Azure Data Center

Microsoft Azure Data Center in Boydton, VA

Loudon County, VA is called “CLoudon”

How data centers power VA’s Loudon County: https://gcn.com/articles/2018/10/12/loudoun-county-data-centers.aspx
The heart of “The Cloud” is in Virginia: https://www.cbsnews.com/news/cloud-computing-loudoun-county-virginia/
CBS Sunday Morning Visits the Home of the Internet in Loudoun County: https://biz.loudoun.gov/2017/10/30/cbs-sunday-morning-visits-loudoun/

70% of the world’s internet traffic passes through Loudon County, VA

References - SaaS & Cloud Computing

SaaS and Cloud Computing Statistics:

References - Data Storage & AI Infrastructure

Global Data Volume & AI Datasets:

2025 global data volume (181 zettabytes): https://rivery.io/blog/big-data-statistics-how-much-data-is-there-in-the-world/ [2], https://explodingtopics.com/blog/data-generated-per-day [6], https://www.pit.edu/news/data-analytics-statistics-and-trends-for-2025/ [7], https://www.forbes.com/councils/forbestechcouncil/2024/03/04/from-5mb-hard-drives-to-180-zettabytes-the-data-migration-challenge/ [9]
CommonCrawl and ImageNet dataset sizes: https://www.ibm.com/think/topics/ai-data-center [10]
AI inference latency and storage tech: https://www.ibm.com/think/topics/ai-data-center [10], https://www.rcrwireless.com/20250407/fundamentals/ai-optimized-data-center [11]

Storage Economics & Technology:

Storage cost declines: https://humanprogress.org/trends/vastly-cheaper-computation/ [1], https://ourworldindata.org/data-insights/the-price-of-computer-storage-has-fallen-exponentially-since-the-1950s [8]
Ongoing significance of transfer (egress) costs: https://lightedge.com/the-data-explosion-and-hidden-data-storage-costs-in-the-cloud-could-object-storage-be-the-answer/ [3]
Edge computing trends: https://mondo.com/insights/evolution-data-storage-tapes-cloud-computing/ [5]

GPU Infrastructure & AI Training Costs:

GPT-4 GPU count and training duration: https://www.genspark.ai/spark/how-many-gpus-are-needed-to-train-gpt4-o/56c3b449-f206-495f-bc9b-d06812bf4a33 [1], https://klu.ai/blog/gpt-4-llm [3], https://www.acorn.io/resources/learning-center/openai/ [5]
Meta’s NVIDIA H100 GPUs: https://www.reddit.com/r/singularity/comments/1bi8rme/jensen_huang_just_gave_us_some_numbers_for_the/ [4]
AI training cost estimates: https://www.genspark.ai/spark/how-many-gpus-are-needed-to-train-gpt4-o/56c3b449-f206-495f-bc9b-d06812bf4a33 [1], https://www.acorn.io/resources/learning-center/openai/ [5]

Lecture 2

Look back

Agenda and Goals for Today

Glossary

Working on a single machine

The BIG DATA problem

Is Moore’s Law Dead?

New Hardware

Need

What

So, we can’t store or process data on a single machine, what do we do?

We distribute

How do we do that?

Simple, we use the cloud

Cloud computing is a big deal!

Benefits

What is the claaaaaaawd (the cloud)

What is the cloud?

\kloud\ noun

NIST Definition

Service Models

Infrastructure as a Service (IaaS)

What is IaaS?

Examples & Use Cases

Platform as a Service (PaaS)

What is PaaS?

Examples & Use Cases

Software as a Service (SaaS)

What is SaaS?

Examples & Use Cases

The SaaS Revolution

Cloud Infrastructure and the AI Revolution

Why Cloud + AI = Perfect Match

The Data Center Explosion

The Numbers

What’s Inside Modern Data Centers?

Environmental Impact

References

GPU Infrastructure: The AI Powerhouse

From Gaming to AI

Cloud GPU Offerings

The Scale Challenge

Distributed Computing and Large Clusters

Modern AI Requires Massive Clusters

Technologies Enabling Scale

Data Storage Evolution

The Data Tsunami

Storage Technologies

The Economics of Data

The evolution of the Cloud

What does the cloud look like?

Virtual Visit to a Microsoft Azure Data Center

Microsoft Azure Data Center in Boydton, VA

Loudon County, VA is called “CLoudon”

70% of the world’s internet traffic passes through Loudon County, VA

References - SaaS & Cloud Computing

References - Data Storage & AI Infrastructure

Time for Lab!