Course overview. Introduction to big data concepts. The Cloud.
Georgetown University
Fall 2025
Course and syllabus overview
Big Data Concepts
Data Engineering
Introduction to bash
These are also pinned on the Slack main
channel
aa1603@georgetown.edu
jj1088@georgetown.edu
aa1603@georgetown.edu
Fun Facts
jj1088@georgetown.edu
Fun Facts
bc928@georgetown.edu
hk932@georgetown.edu
pp755@georgetown.edu
au195@georgetown.edu
yw924@georgetown.edu
ny159@georgetown.edu
ly290@georgetown.edu
xz646@georgetown.edu
bc928@georgetown.edu
hk932@georgetown.edu
pp755@georgetown.edu
au195@georgetown.edu
yw924@georgetown.edu
ny159@georgetown.edu
ly290@georgetown.edu
xz646@georgetown.edu
Data is everywhere! Many times, it’s just too big to work with traditional tools. This is a hands-on, practical workshop style course about using cloud computing resources to do analysis and manipulation of datasets that are too large to fit on a single machine and/or analyzed with traditional tools. The course will focus on Spark, MapReduce, the Hadoop Ecosystem and other tools.
You will understand how to acquire and/or ingest the data, and then massage, clean, transform, analyze, and model it within the context of big data analytics. You will be able to think more programmatically and logically about your big data needs, tools and issues.
Always refer to the syllabus and calendar in the course website for class policies.
dsan-Fall-2025@georgetown.edu
Every 60 seconds in 2025: * ChatGPT serves millions of requests (exact numbers proprietary) * 500 hours of video uploaded to YouTube * 1.04 million Slack messages sent * 362,000 hours watched on Netflix * 5.9-11.4 million Google searches * $443,000 spent on Amazon * AI-generated images created at massive scale (metrics not publicly available) * 347,200 posts on X (formerly Twitter) * 231-250 million emails sent
We can record every: * click * ad impression * billing event * video interaction * server request * transaction * network message * fault * …
???
Many interesting datasets have a graph structure:
Some of these are HUGE
75 billion connected devices generating data: * Smart home devices (Alexa, Google Home, Apple HomePod) * Wearables (Apple Watch, Fitbit, Oura rings) * Connected vehicles & autonomous driving systems * Industrial IoT sensors * Smart city infrastructure * Medical devices & remote patient monitoring
The Internet
Transactions
Databases
Excel
PDF Files
Anything digital (music, movies, apps)
Some old floppy disk lying around the house
You have a laptop with 16GB of RAM and a 256GB SSD. You are given a 1TB dataset in text files. What do you do?
Your company wants to build a RAG system using 10TB of internal documents. You need sub-second query response times. How do you architect this?
You need to process 1 million events per second from IoT devices and provide real-time dashboards with <1 second latency. What’s your stack?
Exponential data growth
“In essence, is a term for a collection of datasets so large and complex that it becomes difficult to process using traditional tools and applications. Big Data technologies describe a new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discover and/or analysis”
“Big data is when the size of the data itself becomes part of the problem”
“Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.”
Volume (Gigabytes -> Exabytes -> Zettabytes)
Velocity (Batch -> Streaming -> Real-time AI inference)
Variety (Structured, Semi-structured, Unstructured, Embeddings)
Can you analyze/process your data on a single machine?
Can you store (or is it stored) on a single machine?
Can you serve it fast enough for real-time AI applications?
If any of of the answers is no then you have a big-ish data problem!
Traditional Use Cases: * Business intelligence * Analytics & reporting * Historical data storage
Modern AI Use Cases: * Training data repositories * Vector embeddings storage * RAG (Retrieval-Augmented Generation) context * Fine-tuning datasets * Evaluation & benchmark data
Raw Data → Data Lake → Processing → Vector DB → LLM Context
Key Components: * Data Lakes (S3, Azure Data Lake): Store massive unstructured data * Data Warehouses (Snowflake, BigQuery): Structured data for context * Vector Databases (Pinecone, Weaviate, Qdrant): Semantic search * Embedding Models: Convert data to vectors * Orchestration (Airflow, Prefect): Manage the pipeline
What is MCP? * Open protocol for connecting AI assistants to data sources * Standardized way to expose tools and data to LLMs * Enables “agentic” behavior - AI that can act autonomously
Data Warehouse → MCP Server → AI Agent → Action
Examples: * AI agents querying Snowflake for real-time analytics * Autonomous systems updating data lakes based on predictions * Multi-agent systems coordinating through shared data contexts
Garbage In, Garbage Out - Amplified: * Bad training data → Biased models * Incorrect RAG data → Hallucinations * Poor data governance → Compliance issues
Unified Platforms: * Data lakes becoming “AI lakes” * Integrated vector + relational databases * One-stop shops for data + AI (Databricks, Snowflake Cortex)
Edge Computing + AI: * Processing at the data source * Federated learning across devices * 5G enabling real-time edge AI
Synthetic Data: * AI generating training data for AI * Privacy-preserving synthetic datasets * Infinite data generation loops
Query Engines: * DuckDB - In-process analytical database * Polars - Lightning-fast DataFrame library
* Spark - Distributed processing at scale
Data Warehouses & Lakes: * Snowflake - Cloud-native data warehouse * Athena - Serverless SQL on S3 * Iceberg - Open table format
AI/ML Integration: * Vector databases for embeddings * RAG implementation patterns * Streaming with Spark Structured Streaming
Orchestration: * Airflow for pipeline management * Serverless with AWS Lambda
Small Data is usually… | On the other hand, Big Data… | |
---|---|---|
Goals | gathered for a specific goal | may have a goal in mind when it’s first started, but things can evolve or take unexpected directions |
Location | in one place, and often in a single computer file | can be in multiple files in multiple servers on computers in different geographic locations |
Structure/Contents | highly structured like an Excel spreadsheet, and it’s got rows and columns of data | can be unstructured, it can have many formats in files involved across disciplines, and may link to other resources |
Preparation | prepared by the end user for their own purposes | is often prepared by one group of people, analyzed by a second group of people, and then used by a third group of people, and they may have different purposes, and they may have different disciplines |
Longevity | kept for a specific amount of time after the project is over because there’s a clear ending point. In the academic world it’s maybe five or seven years and then you can throw it away | contains data that must be stored in perpetuity. Many big data projects extend into the past and future |
Measurements | measured with a single protocol using set units and it’s usually done at the same time | is collected and measured using many sources, protocols, units, etc |
Reproducibility | be reproduced in their entirety if something goes wrong in the process | replication is seldom feasible |
Stakes | if things go wrong the costs are limited, it’s not an enormous problem | can have high costs of failure in terms of money, time and labor |
Access | identified by a location specified in a row/column | unless it is exceptionally well designed, the organization can be inscrutable |
Analysis | analyzed together, all at once | is ordinarily analyzed in incremental steps |
V | Challenge |
---|---|
Volume | data scale |
Value | data usefulness in decision making |
Velocity | data processing: batch or stream |
Viscosity | data complexity |
Variability | data flow inconsistency |
Volatility | data durability |
Viability | data activeness |
Validity | data properly understandable |
Variety | data heterogeneity |
William Cohen (Director, Research Engineering, Google) said the following:
Working with big data is not about:
Working with big data is about understanding:
R
and Python
are single threadedWe’ll talk briefly about Apache Hadoop today but we will not cover it in this course.
Other:
Matt Turck’s Machine Learning, Artificial Intelligence & Data Landscape (MAD)
In this course, you’ll be doing a little data engineering!
username@hostname:current_directory $
What do we learn from the prompt?
COMMAND -F --FLAG
* COMMAND is the program * Everything after that are arguments * F is a single letter flag * FLAG is a single word or words connected by dashes flag. A space breaks things into a new argument. + Sometimes single letter and long form flags (e.g. F and FLAG) can refer to the same argument
COMMAND -F --FILE file1
Here we pass an text argument “file1” into the FILE flag
The -h
flag is usually to get help. You can also run the man
command and pass the name of the program as the argument to get the help page.
Let’s try basic commands:
date
to get the current datewhoami
to get your user nameecho "Hello World"
to print to the consoleFind out your Present Working Directory pwd
Examine the contents of files and folders using the ls
command
Make new files from scratch using the touch
command
Globbing - how to select files in a general way
\*
for wild card any number of characters\?
for wild card for a single character[]
for one of many character options!
for exclusion[:alpha:]
, [:alnum:]
, [:digit:]
, [:lower:]
, [:upper:]
Knowing where your terminal is executing code ensures you are working with the right inputs and making the right outputs.
Use the command pwd
to determine the Present Working Directory.
Let’s say you need to change to a folder called “git-repo”. To change directories you can use a command like cd git-repo
.
.
refers to the current directory, such as ./git-repo
..
can be used to move up one folder, use cd ..
, and can be combined to move up multiple levels ../../my_folder
/
is the root of the Linux OS, where there are core folders, such as system, users, etc.~
is the home directory. Move to folders referenced relative to this path by including it at the start of your path, for example ~/projects
.To view the structure of directories from your present working directory, use the tree
command
Now that we know how to navigate through directories, we need to learn the commands for interacting with files
mv
to move files from one location to another
cp
to copy files instead of moving
mkdir
to make a directoryrm
to remove filesrmdir
to remove directoriesrm -rf
to blast everything! WARNING!!! DO NOT USE UNLESS YOU KNOW WHAT YOU ARE DOINGCommands:
head FILENAME
/ tail FILENAME
- glimpsing the first / last few rows of datamore FILENAME
/ less FILENAME
- viewing the data with basic up / (up & down) controlscat FILENAME
- print entire file contents into terminalvim FILENAME
- open (or edit!) the file in vim editorgrep FILENAME
- search for lines within a file that match a regex expressionwc FILENAME
- count the number of lines (-l
flag) or number of words (-w
flag)|
sends the stdout to another command (is the most powerful symbol in BASH!)>
sends stdout to a file and overwrites anything that was there before>>
appends the stdout to the end of a file (or starts a new file from scratch if one does not exist yet)<
sends stdin into the command on the leftTo-dos:
echo Hello World
.bashrc is where your shell settings are located
If we wanted a shortcut to find out the number of our running processes, we would write a commmand like whoami | xargs ps -u | wc -l
.
We don’t want to write out this full command every time! Let’s make an alias.
alias alias_name="command_to_run"
alias nproc="whoami | xargs ps -u | wc -l"
Now we need to put this alias into the .bashrc
alias nproc="whoami | xargs ps -u | wc -l" >> ~/.bashrc
What happened??
echo alias nproc="whoami | xargs ps -u | wc -l" >> ~/.bashrc
Your commands get saved in ~/.bash_history
Use the command ps
to see your running processes.
Use the command top
or even better htop
to see all the running processes on the machine.
Install the program htop using the command sudo yum install htop -y
Find the process ID (PID) so you can kill a broken process.
Use the command kill [PID NUM]
to signal the process to terminate. If things get really bad, then use the command kill -9 [PID NUM]
To kill a command in the terminal window it is running in, try using Ctrl + C or Ctrl + /
Run the cat
command on its own to let it stay open. Now open a new terminal to examine the processes and find the cat process.
Reference material: Text Lesson 1,2,3,7,9,10
https://gitlab.com/slackermedia/bashcrawl is a game to help you practice your navigation and file access skills. Click on the binder link in this repo to launch a jupyter lab session and explore!