Course overview, introduction to big data and the Cloud
Georgetown University
Fall 2023
bash
These are also pinned on the Slack main
channel
Fun Facts
Fun Facts
Fun Facts
Fun Facts
Where does it come from?
How is it being created?
We can record every:
More from the MotherDuck Blog
“In essence, is a term for a collection of datasets so large and complex that it becomes difficult to process using traditional tools and applications. Big Data technologies describe a new generation of technologies and architectures designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discover and/or analysis”
“Big data is when the size of the data itself becomes part of the problem”
“Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.”
Can you analyze/process your data on a single machine?
Can you store (or is it stored) on a single machine?
If any of of the answers is no then you have a big-ish data problem!
The definitition of “big data” is relative!
We’ll talk briefly about Apache Hadoop today but we will not cover it in this course.
Other:
Matt Turck’s Machine Learning, Artificial Intelligence & Data Landscape (MAD)
One big box, all processors share memory
This was:
It was all premium hardware. And yet is still was not big enough!
Not expensive, premium nor fancy in any way
Desktop-like servers are cheap, so buy a lot!
Needed more complex software to be able to run on lots of smaller/cheaper machines.
Hang on to this thought!
They needed a to build some kind of distributed storage layer to be the foundation of a scalable system. The came up with these requirements:
Does this sound familiar?
Describes how Google stored its information, at scale, using a reliable and high-available storage system can be built on commodity machines considering that failures are the norm rather than the exception.
GFS is:
Describes how Google processes data, at scale using MapReduce, a paradigm based on functional programming. MapReduce is an approach and infrastructure for doing things at scale. MapReduce is two things:
Which provides the following benefits:
The Current Hadoop Ecosystem: https://hadoopecosystemtable.github.io/
username@hostname:current_directory $
What do we learn from the prompt?
COMMAND -F --FLAG
COMMAND -F --FILE file1
Here we pass an text argument “file1” into the FILE flag
The -h
flag is usually to get help. You can also run the man
command and pass the name of the program as the argument to get the help page.
Let’s try basic commands:
date
to get the current datewhoami
to get your user nameecho "Hello World"
to print to the consoleFind out your Present Working Directory pwd
Examine the contents of files and folders using the ls
command
Make new files from scratch using the touch
command
Globbing - how to select files in a general way
\*
for wild card any number of characters\?
for wild card for a single character[]
for one of many character options!
for exclusion[:alpha:]
, [:alnum:]
, [:digit:]
, [:lower:]
, [:upper:]
Knowing where your terminal is executing code ensures you are working with the right inputs and making the right outputs.
Use the command pwd
to determine the Present Working Directory.
Let’s say you need to change to a folder called “git-repo”. To change directories you can use a command like cd git-repo
.
.
refers to the current directory, such as ./git-repo
..
can be used to move up one folder, use cd ..
, and can be combined to move up multiple levels ../../my_folder
/
is the root of the Linux OS, where there are core folders, such as system, users, etc.~
is the home directory. Move to folders referenced relative to this path by including it at the start of your path, for example ~/projects
.To view the structure of directories from your present working directory, use the tree
command
Now that we know how to navigate through directories, we need to learn the commands for interacting with files
mv
to move files from one location to another
cp
to copy files instead of moving
mkdir
to make a directoryrm
to remove filesrmdir
to remove directoriesrm -rf
to blast everything! WARNING!!! DO NOT USE UNLESS YOU KNOW WHAT YOU ARE DOINGCommands:
head FILENAME
/ tail FILENAME
- glimpsing the first / last few rows of datamore FILENAME
/ less FILENAME
- viewing the data with basic up / (up & down) controlscat FILENAME
- print entire file contents into terminalvim FILENAME
- open (or edit!) the file in vim editorgrep FILENAME
- search for lines within a file that match a regex expressionwc FILENAME
- count the number of lines (-l
flag) or number of words (-w
flag)|
sends the stdout to another command (is the most powerful symbol in BASH!)>
sends stdout to a file and overwrites anything that was there before>>
appends the stdout to the end of a file (or starts a new file from scratch if one does not exist yet)<
sends stdin into the command on the leftTo-dos:
echo Hello World
.bashrc is where your shell settings are located
If we wanted a shortcut to find out the number of our running processes, we would write a commmand like whoami | xargs ps -u | wc -l
.
We don’t want to write out this full command every time! Let’s make an alias.
alias alias_name="command_to_run"
alias nproc="whoami | xargs ps -u | wc -l"
Now we need to put this alias into the .bashrc
alias nproc="whoami | xargs ps -u | wc -l" >> ~/.bashrc
What happened??
echo alias nproc="whoami | xargs ps -u | wc -l" >> ~/.bashrc
Your commands get saved in ~/.bash_history
Use the command ps
to see your running processes.
Use the command top
or even better htop
to see all the running processes on the machine.
Install the program htop using the command sudo yum install htop -y
Find the process ID (PID) so you can kill a broken process.
Use the command kill [PID NUM]
to signal the process to terminate. If things get really bad, then use the command kill -9 [PID NUM]
To kill a command in the terminal window it is running in, try using Ctrl + C or Ctrl + /
Run the cat
command on its own to let it stay open. Now open a new terminal to examine the processes and find the cat process.
Bash crawl is a game to help you practice your navigation and file access skills. Click on the binder link in this repo to launch a jupyter lab session and explore!
DSAN 6000 | Fall 2023 | https://gu-dsan.github.io/6000-fall-2023/