Lecture 2

Introduction to the cloud - extended lab

Amit Arora, Abhijit Dasgupta, Anderson Monken, and Marck Vaisman

Georgetown University

Fall 2023

Look back and ahead

  • Great use of Slack
  • Big data definitions
  • Used the Shell in Linux on a virtual machine through Codespaces

Deadlines

  • Lab 1: Linux Basics Due Aug 29 6pm
  • Quiz 1: Background Skills Due Sept 2 12pm
  • Assignment 1: Python Skills Due Sept 5 11:59pm
  • Lab 2: Cloud Tooling Due Sept 5 6pm
  • Assignment 2: Shell & Linux Due Sept 11 11:59pm

Agenda and Goals for Today

Lecture

  • Background of the cloud and related concepts
  • Tour of the cloud services that are used in the course

Lab

  • AWS SageMaker Setup
    • AWSAcademy Learning Management System (LMS)
    • AWS Console Overview
    • SageMaker Setup
  • Azure Machine Learning Setup
    • Create resource group
    • Deploy Azure Machine Learning
    • Create Azure Machine Learning compute instance
  • Work with VSCode on Azure Machine Learning
    • Browser-based VSCode
    • Locally-based VSCode (if time permits)
  • Lab deliverable on GitHub Classroom

Are you ready for the Cloud?

Working on a single machine

You are most likely using traditional data analysis tools, which are single threaded and run on a single machine.

The BIG DATA problem

Is Moore’s Law Dead?

Is Moore’s Law Dead?

New Hardware

Need

  • The demand for data processing will not be met by relying on the same technology.
  • The key to modern data processing is new semiconductors
    • Not just squeezing more transistors per area
    • Need new compute architectures that are built and optimized for specialized functions
  • Specialized edge hardware for Edge Computing
  • While many declare Moore’s Law to be broken or no longer valid, in reality it’s not the law that is broken but rather a heat problem.

What

  • Graphic Processing Units (GPUs)

  • Field Programmable Gate Arrays (FPGAs)

  • Data Processing Units (DPUs)

  • Photonic computing

So, we can’t store or process data on a single machine, what do we do?

We distribute

More CPUs, more memory, more storage!

How do we do that?

Simple, we use the cloud

Simple, we use the cloud

Cloud computing is a big deal!

Benefits

  • Provides access to low-cost computing

  • Costs are decreasing every year

  • Elastic

  • PAAS works!

  • Many other benefits…

What is the claaaaaaawd (the cloud)

What is the cloud?

\kloud\ noun

the practice of storing regularly used computer data on multiple servers that can be accessed through the Internet

What is the cloud?

\kloud\ noun

the practice of storing regularly used computer data on multiple servers that can be accessed through the Internet

Using someone else’s computer(s)

NIST Definition

Cloud Services Models

Cloud Services Models Analogy

The evolution of the Cloud

Yesterday Today Tomorrow
Limited number of tools and vendors Many tools and vendors to work with Integrated tools and vendors
One platform - few devices Multiple platforms - many devices Connected platforms and devices
Data is scarce but manageable Overabundance of data Data is used for important business decisions
IT has major influence and control IT has limited influence and control IT is strategic to the business
People only work when they are at work People work wherever they want People have access to what they need, wherever they are

What does the cloud look like?

Virtual Visit to a Microsoft Azure Data Center

Microsoft Azure Data Center in Boydton, VA

Loudon County, VA is called “CLoudon”

  • How data centers power VA’s Loudon County: https://gcn.com/articles/2018/10/12/loudoun-county-data-centers.aspx

  • The heart of “The Cloud” is in Virginia: https://www.cbsnews.com/news/cloud-computing-loudoun-county-virginia/

  • CBS Sunday Morning Visits the Home of the Internet in Loudoun County: https://biz.loudoun.gov/2017/10/30/cbs-sunday-morning-visits-loudoun/

NoVa Data Center Map

70% of the world’s internet traffic passes through Loudon County, VA

Overview of services used in this class

AWS

Azure

Services we will be using

Category AWS Service Azure Service Description
Compute (virtual machines) Elastic Compute Cloud (EC2) Instances Virtual Machines Virtual servers allow users to deploy, manage, and maintain OS and server software. Instance types provide combinations of CPU/RAM. Users pay for what they use with the flexibility to change sizes.
Storage Simple Storage Services (S3) Blob storage Object storage service, for use cases including cloud applications, content distribution, backup, archiving, disaster recovery, and big data analytics.
Networking Cloud virtual networking Virtual Private Cloud (VPC) Virtual Network

Services we will use

Category AWS Service Azure Service Description
Machine Learning Platforms AWS Sagemaker Azure Machine Learning Machine learning platforms that abstract away details of networking, compute, networking, etc. so you can focus on the analytics, big data, and modeling.
IoT Streaming Kinesis Firehose, Kinesis Streams Event Hubs Services that allow the mass ingestion of small data inputs, typically from devices and sensors, to process and route the data.

Services we may use

Category AWS Service Azure Service Description
Relational database RDS SQL Database
Database for MySQL
Database for PostgreSQL
Managed relational database service where resiliency, scale, and maintenance are primarily handled by the platform.
NoSQL / Document DB DynamoDB

SimpleDB

Amazon DocumentDB
Cosmos DB A globally distributed, multi-model database that natively supports multiple data models: key-value, documents, graphs, and columnar.
Big Data Processing EMR Databricks Apache Spark-based analytics platform.
Big Data Processing EMR HDInsight Managed Hadoop service. Deploy and manage Hadoop clusters in Azure.

Mapping of AWS <–> Azure services

https://aka.ms/awsazureguide

Interfacing with cloud services

  • portal / graphical interface
  • command line
  • REST APIs

Lab Time!

Set Up GitHub Personal Access Token

Link to GitHub setup

Glossary

Term Definition
Local Your current workstation (laptop, desktop, etc.), wherever you start the terminal/console application.
Remote Any machine you connect to via ssh or other means.
EC2 Single virtual machine in the AWS cloud where you can run computation (ephemeral)
SageMaker Integrated Developer Environment in AWS where you can conduct data science on single machines
Azure Machine Learning Azure's managed machine learning platform that also provides interactive computing
VSCode Visual Studio Code
Compute Instance A virtual machine running inside Azure or AWS
Ephemeral Lasting for a short time - any machine that will get turned off or place you will lose data
Persistent Lasting for a long time - any environment where your work is NOT lost when the timer goes off