Syllabus

Modified

Wednesday, August 27, 2025

Instructor

Course details

  •   Wednesdays (Prof Amit), Mondays & Thursdays (Prof Jeff)
  •   August 27 – December 20, 2025
  •   9:30 AM–12:00 PM (Amit), 3:30-6:00 PM (Jeff Mon), 6:30-9:00 PM (Jeff Thu)
  •   Reiss 262 (W, Amit), Walsh 394 (Mon & Thu, Jeff)
  •   Slack (Join)

This syllabus was last updated on Wednesday, August 27, 2025

Instructors

TAs

Communication

Important

The primary mode of communications will be Slack (Join link). Workspace name: DSAN6000 Fall 2025

Instructional team e-mail: dsan-Fall-2025@georgetown.edu
This is the preferred way of contacting the professors privately

This website will be up-to-date on information about the course


Course Description

Data is everywhere! Many times, the data is just too big to analyze with traditional programming libraries on your laptop. That is where cloud providers and distributed computation save the day. This is a hands-on, practical workshop-style course about using cloud computing resources to do analysis and manipulation of big data datasets that are too large to fit on a single machine and/or analyzed with traditional tools. The course will focus on techniques (such as Concurrency) and tools (such as Apache Spark) for working with Big Data.

You will understand how to ingest the data, and then massage, clean, transform, analyze, and model it within the context of big data analytics. You will be able to think more programmatically and logically about your big data needs, tools, and issues.

Learning Objectives

  • Setup, operate and manage big data tools and cloud infrastructure, including Spark, MapReduce, Amazon Web Services
  • Use ancillary tools that support big data processing, including git and the Linux command line
  • Execute a big data analytics exercise from start to finish: ingest, wrangle, clean, analyze, store, and present
  • Develop strategies to break down large problems and datasets into manageable pieces
  • Identify broad spectrum resources and documentation to remain current with big data tools and developments
  • Communicate and interpret the big data analytics results through written and verbal methods

Pre-requisites

  • Experience with Python and SQL. Note: We will use Python as the primary interface to Apache Spark, through PySpark
  • Experience with git and GitHub

Some tutorials to brush up on these skills:

Books, Software and Cloud Resources

Readings (for assigned readings)

There is no required textbook for the course. We have selected specific chapters from several sources as well as several seminal papers in the big data space, and these will be provided to you in PDF format. We may also provide supplemental materials (articles, links, videos, etc.) to complement the readings. You must read assigned readings prior to the lectures.

Cloud Resources

You will use cloud resources on Amazon Web Services. We will discuss how to setup your accounts and environments in class and lab within the first couple of weeks. You will get credits on both platforms that will be enough to support your coursework throughout the semester.

Modules

Week Module Details
1 Course overview. Introduction to big data concepts. Overview of course objectives and an introduction to big data and Linux shell basics.
2 Introduction to the cloud. Introduction to cloud computing concepts and platforms like AWS and Azure.
3 Parallelization (multiprocessing, asyncio) Exploring parallel computing with Python using multiprocessing and asyncio for concurrency.
4 DuckDB, Polars, and file formats Working with DuckDB, Polars, and the Parquet file format for efficient data processing.
5 Data Warehouse Understanding data warehousing with Presto, Snowflake, and Athena; hands-on lab with Presto/Athena.
6 Introduction to Spark, RDDs, and Dataframes Introduction to Apache Spark, focusing on RDDs and DataFrames for distributed data processing.
7 Spark DataFrames and Spark SQL Advanced use of Spark DataFrames and Spark SQL for structured data analysis.
8 Spark ML Utilizing Spark’s MLlib for scalable machine learning tasks.
9 Spark NLP Applying Spark for Natural Language Processing (NLP) tasks on large datasets.
10 Spark Streaming Real-time data processing with Spark Streaming for handling streaming data.
11 Accelerate Python workloads with Ray, RAPIDS Leveraging Ray and RAPIDS to accelerate Python-based data processing with distributed computing and GPUs.
12 Vector databases Introduction to vector databases and their use in managing large-scale AI-driven applications.
13 Data engineering with serverless (Lambda) Building and deploying data engineering solutions using AWS Lambda and other serverless technologies.
Thanksgiving break No class due to Thanksgiving break.
14 Project, open-discussion Final project presentations and open discussion on course topics.
Warning

IT IS YOUR RESPONSIBILITY TO MANAGE THE CREDITS AND RESOURCES PROVIDED TO YOU. YOU MUST SHUT DOWN YOUR CLOUD RESOURCES WHEN NOT IN USE.

Learning Activities, Communication and Evaluation

This is a hands-on, practical, workshop style course that provides opportunities to use the tools and techniques discussed in class. Although this is not a programming course per se, there is programming involved.

Lectures and Labs

This course is split into a lecture/lab format, where every class session will have a lecture portion, and most sessions will have an in-class lab portion:

  • During the lecture, we will discuss the concepts and techniques as well as the history and development of these big data tools and cloud platforms.
  • During the lab sessions, you will be completing exercises and following examples which are designed to show you how to implement the ideas and concepts with various tools. We will start the labs in class but we will not finish. It is your responsibility to complete the labs (which is part of the grade) and will enable your learning.

Lectures may not cover all the material and some topics will be introduced in the lab or through readings/assignments.

Office Hours

Instructors and TAs will hold recurring office hours to answer questions, review material, and support your learning. The times, dates, and location of office hours will be announced via Canvas and Slack in advance.

Readings

On certain weeks, readings will be assigned to prepare you for the lecture material being presented. These readings should take an hour or less per week.

Online Quizzes

Quizzes will be given a few times during the semester during lab or lecture at random intervals and times. Quizzes ensure you are keeping up with the material presented in the class. Quizzes are meant to be brief and low-stress with a time limit of 5-10 minutes. The material will be drawn from lectures, labs, and readings.

Lab Deliverables

Each lab will have a deliverable. It is essential that you learn the skills presented in the labs so that you can effectively complete the assignments and the big data project. The lab deliverables can sometimes be completed during lab, however, it is your responsibility to complete the deliverable as part of your work outside of lecture/lab time.

Homework Assignments

You will be several homework assignments for roughly half of the semester. The goal of these problem sets is to hone your big data skills by answering some questions about large datasets. The problem sets will build on the labs and will be much more in-depth. Deliverables from the assignment will usually include code written for your programs and the output produced.

Warning

Please start assignments as soon as they are posted. These assignments can take several hours to complete depending on your familiarity with the material. You will not complete the assignments on time if you start the day they are due.

Note

We reuse problem set questions, we expect students not to copy, refer to, or look at the solutions in preparing their answers. Since this is a graduate-level class, we expect students to want to learn and not search online for answers. See Academic Integrity section.

Big Data Analytics Project

Students will assemble into groups of no more than 4 students. You will perform and write up an analysis of a big dataset using the tools learned in class. Big is defined as “a dataset that is so large that you cannot work with it on a laptop.”

You will each conduct big data project using Reddit textual data. This project will encompass all of the skills you will learn throughout the semester including data ingestion, transformation, analysis, natural language processing, and machine learning. Similar to previous core courses, the big data project will be a product you can share with the world to demonstrate your data science expertise. Intermediate project assignments will help you incrementally build towards your final portfolio. There will be assignments on exploratory data analysis, machine learning, and natural language processing. Each successive assignment and the final submissions will incorporate feedback from your peers, TAs, and professors. The final timelines and deliverables for the project will be announced in class.

Grading and Evaluation

Grade Weights

  • Homework: 30%
  • Lab: 10%
  • Case Study: 15%
  • Discussion: 15%
  • Individual Project Portfolio: 30%

Total is 100%.

Grading Scale

We have no plans to curve the final grade, and the final letter grade will be:

  • A: ≥ 92.5%
  • A-: 89.5 - 92.49%
  • B+: 87.99 - 89.49%
  • B: 81.5 - 87.98%
  • B-: 79.5 - 81.49%
  • C: 70 - 79.49%
  • D: 60 - 70%
  • F: 60% or less

Submitting your work

GitHub classroom

We use Github Classroom for all class deliverables: assignments, labs, and the final project. Submitting your work is the process of committing your files and results to your local repository and then pushing it to GitHub.

Important

You must submit everything through GitHub!

Use the final-submission commit message

When you are ready for your work to be evaluated, you MUST use the commit message final-submission. If you do not use the commit message final-submission we will assume that you are still working in the repository and we will only grade what is present. By submitting that commit message, you are stating that you are finished with the assignment and are ready for feedback.

Important

Make sure you understand the difference between a git commit and a push, and that you push your repository successfully to GitHub.

In case you need to make a correction after your final-submission and the submission deadline has not yet passed, then you can amend your previous commit. See amending a commit for instructions. Do not change the commit message, it should continue say “final-submission” after the amend.

Warning

No further edits to your GitHub repository are allowed after using the final-submission commit message.

Important

We will use commit datetime and commit message to assess lateness.

Late policy

In lieu of extensions, there is a tiered deduction scale if a deliverable is late. Late penalties only apply to labs and assignments.

We will assess exceptional circumstances on a case-by-case basis, and only if we are made aware before a deliverable’s deadline, not after.

  • A late penalty of 10% per day, up to 4 days, will be assessed for assignments and labs that are submitted with a final-submission commit message after the deadline. You may still submit a missed lab or assignment up until the last day of class (Dec 20th, 2025) with a maximum possible grade of 60%.
  • Missed in-class quizzes cannot be made up and will receive a grade of zero.
  • Project deadlines are fixed and have no extensions or late penalty. A missed project deliverable will receive a grade of zero.

Other course policies

Attendance and punctuality

Attendance is mandatory and will be taken. Given the technical nature of this course, and the breadth of topics discussed, you are expected to attend each class, to complete all readings, and to participate actively in lectures, discussions and exercises. We understand there may be times you may need to miss class, please inform us in advance if you are not able to attend class for any reason. However, it is up to you to keep up.

Participation

We love participation. Read. Raise your hand. Ask questions. Make comments. Challenge us. Acknowledge us. If we speak for three hours to a silent classroom, it is a lot more boring and tiring for everyone.

Laptop and phone use

You must bring your laptop to class to work on labs. No phone use is allowed during lecture. You may use your laptop during lecture to take notes, but please refrain from other activities. We reserve the right to ask you to put your phones and laptops away. You may not use your computer or phone while your peers or guest speakers are presenting.

Communication and Slack Rules

  • All announcements will be posted on Canvas and Slack
  • Use Slack for any question you may have about the course, about assignments or any technical issue. This way everyone can learn from each others questions. We will be monitoring and providing answers on a regular basis. Make sure you understand what is allowed in Slack.
  • Individual emails containing any course question that is not personal will not be answered
  • Slack DMs are not to be used unless we DM you first and you can respond to our message. Students may not initiate DMs.
  • Keep an eye on the questions posted in Slack. Use the search function. It’s very possible that we have already answered a question, and we reserve the right to point you to the syllabus, previous Slack messages, or other document containing the information requested
  • Assignment, lab and project questions will only be answered on Slack up to 12 hours before something is due

Open Door Policy

Please approach or get in touch with us if something is not working for you regarding the class, methods, etc. Our pledge to you is to provide the best learning experience possible. If you have any issue please do not wait until the last minute to speak with us. You will find that we are fair, reasonable, and flexible and we care deeply about your learning and success.

Academic Integrity

As a Jesuit, Catholic university, committed to the education of the whole person, Georgetown expects all members of the academic community, students and faculty, to strive for excellence in scholarship and in character.The University spells out the specific minimum standards for academic integrity in its Honor Code, as well as the procedures to be followed if academic dishonesty is suspected.

Over and above the honor code, in this course we will seek to create an engaged and passionate learning environment, characterized by respect and courtesy in both our discourse and our ways of paying attention to one another.

The code of academic integrity applies to all courses at Georgetown University. Please become familiar with the code. All students are expected to maintain the highest level of academic integrity throughout the course of the semester.Please note that acts of academic dishonesty during the course will be prosecuted and harsh penalties may be sought for such acts. Students are responsible for knowing what acts constitute academic dishonesty. The code may be found at https://bulletin.georgetown.edu/regulations/honor/.

Caution

We have a ZERO TOLERANCE POLICY and students found to be in violation will be reported and penalized. The consequences of any violation may include: additional points penalty, getting a grade of zero, automatically failing the course, and suspension or expulsion from the program.

Definition of collaboration

In the spirit of fostering a collective and inclusive learning environment, we acknowledge that you will work and study with your peers. We also acknowledge that you use web resources (code examples specifically), and that in writing a program many of you will most likely use the same libraries, functions and other similar instructions in your scripts. However:

  • You must write your own code. This will be verified for every assignment against every submission, and any similarity greater than 60% between students on a given assignment will be considered to be unauthorized collaboration.
  • You must do your individual work in your own cloud resources. This will be verified for every assignment. We know the fingerprint of your cloud account and subscriptions and we can tell.

What is allowed

  • Collaborating with other students during in-class labs to facilitate collective learning
  • Using Slack for helping one-another as long as:
    • You do not provide answers directly but only discuss potential approaches
    • You only share up to a few lines of code for everyone’s benefit for the resolution of a specific question or issue
  • Using anything (code, resources, tips, approaches, etc.) provided by the instructional team

What is forbidden

The following actions are not permitted in any way and are considered a violation of academic integrity:

  • Copying and sharing code between students in individual assignments or across goups in the group project
  • Sharing anything on any individual assignment
  • Using code snippets found online (stack overflow, etc.) and not commenting the source
  • Plagiarism of any kind
  • Using any Generative Artificial Intelligence tool without acknowledging it
  • Using someone else’s cloud resources
  • Making your private GitHub repos public
  • Sharing or posting any course materials anywhere
  • Faking or tampering with git commit dates or messages

Use of Generative AI tools

We recognize the recent availability of very powerful generative AI tools like Chat-GPT, GitHub Copilot, and others. These tools can help us be more effective and we embrace their use.

Important

You are allowed to use GAI tools in a non substantial way.

What does non substantial mean?

It means that whatever is generated by GAI must not make up the majority of the work you do.

Any use of these tools must abide to the following rules:

  • You must acknowledge the use of GAI tools
  • You must comment which code blocks were generated by GAI
  • You must note which written sections were generated by GAI
  • If you used a prompt to ask the GAI tool to do something, you must include it

For this course, valid uses of gen-ai can be:

  • Generating a code snippet or single function to perform a task. It’s likely you’ll need to modify it anyway
  • Commenting code
  • Using it as a writing aid (spelling, grammar, word choice, limited phrase translation) on content created by you, not the actual writing. Note: non-native English speakers cannot use gen-ai to fully translate content written in another language.
  • Generating visualization starter code (you can accelerate the generation of the starting point, but you still need to customize the viz with all the best practices learned in this course)
Warning

Any deviation from these rules is considered a violation of academic integrity and will be acted on.

You typically KNOW when you are crossing the line into un-ethical territory. As a general rule, If you feel like you might be crossing a line, then you probably are!

In addition to what we are stating here, please take a look at the Data Science and Analytics Program’s ChatGPT usage guidelines.

Georgetown University resources and policies

Georgetown University’s Plagiarism Policy

Plagiarism or academic dishonesty in any form will not be tolerated and may result in a failing grade. All Honor Code violations will be submitted to the Honor Council.

Academic integrity is central to the learning and teaching process. Students are expected to conduct themselves in a manner that will contribute to the maintenance of academic integrity by making all reasonable efforts to prevent the occurrence of academic dishonesty. Academic dishonesty includes (but is not limited to) obtaining or giving aid on an examination, having unauthorized prior knowledge of an examination, doing work for another student, and plagiarism of all types, including copying code.

Plagiarism is the intentional or unintentional presentation of another person’s idea or product as one’s own. Plagiarism includes, but is not limited to the following: copying verbatim all or part of another’s written work; using phrases, charts, figures, illustrations, code, or mathematical/scientific solutions without citing the source; paraphrasing ideas, conclusions, or research without citing the source; and using all or part of a literary plot, poem, film, musical score, or other artistic product without attributing the work to its creator. Students can avoid unintentional plagiarism by following carefully accepted scholarly practices. Notes taken for papers and research projects should accurately record sources cited, quoted, paraphrased, or summarized sources and articles should be acknowledged in footnotes.

Honor System

All students are expected to maintain the highest standards of academic and personal integrity in pursuit of their education at Georgetown. Academic dishonesty, including plagiarism, in any form, is a serious offense, and students found in violation are subject to academic penalties that include, but are not limited to, failure of the course, termination from the program, and revocation of degrees already conferred. All students are held to the Georgetown University Honor Code. For more information about the Honor Code http://gervaseprograms.georgetown.edu/honor/

Academic Integrity and Courtesy

As a Jesuit, Catholic university committed to the education of the whole person, Georgetown expects all members of the academic community, students and faculty, to strive for excellence in scholarship and in character. The University spells out the specific minimum standards for academic integrity in its Honor Code and the procedures to be followed if academic dishonesty is suspected. Over and above the honor code, in this course, we will seek to create an engaged and passionate learning environment characterized by respect and courtesy in both our discourse and our ways of paying attention to one another.

Academic Resource Center

The Academic Resource Center (ARC) is the campus office responsible for reviewing medical documentation and determining reasonable accommodations for students with disabilities. You can reach the ARC via email at arc@georgetown.edu.

Counseling and Psychiatric Services (CAPS)

As Georgetown faculty, you are among the most important individuals in some of the students’ lives. They may turn to you when they are struggling and in times of need, or you may be one of the first to notice when they are distressed.

The CAPS website has tips for faculty on how to deal with struggling or distressed students. 202.687.6985 or after hours, call (833) 960-3006 to reach Fonemed, a telehealth service; individuals may ask for the on-call CAPS clinician.

Emergency Preparedness and HOYAlert

We encourage all faculty to become familiar with Georgetown’s Office of Emergency Management and sign up for HOYAlert to receive important safety and University operating status updates. Faculty teaching at the Georgetown Downtown campus might also want to sign up for AlertDC to obtain safety and traffic updates.

Office of Institutional Compliance and Ethics

The Office of Institutional Compliance and Ethics supports and coordinates many compliance-related activities the University undertakes. With the endorsement and assistance of the University’s senior leadership, this Office is responsible for leading the development, implementation, and operation of the Georgetown Institutional Compliance and Ethics Program.

Office of Institutional Diversity, Equity and Affirmative Action (IDEAA)

The mission of IDEAA is to promote a deep understanding and appreciation among the diverse members of the University community to result in justice and equality in educational, employment, and contracting opportunities, as well as to lead efforts to create an inclusive academic and work environment.

Title IX/Sexual Misconduct

Georgetown University and its faculty are committed to supporting survivors and those impacted by sexual misconduct, which includes sexual assault, sexual harassment, relationship violence, and stalking. Georgetown requires faculty members unless otherwise designated as confidential, to report all disclosures of sexual misconduct to the University Title IX Coordinator or a Deputy Title IX Coordinator. Suppose you disclose an incident of sexual misconduct to a professor in or outside of the classroom (except disclosures in papers). In that case, that faculty member must report the incident to the Title IX Coordinator or Deputy Title IX Coordinator. The coordinator will, in turn, reach out to the student to provide support, resources, and the option to meet—[Please note that the student is not required to meet with the Title IX coordinator.]. More information about reporting options and resources can be found on the Sexual Misconduct Website.

If you would prefer to speak to someone confidentially, Georgetown has a number of fully confidential professional resources that can provide support and assistance. These resources include:

  • Health Education Services for Sexual Assault Response and Prevention: confidential email sarp@georgetown.edu
  • Counseling and Psychiatric Services (CAPS): 202.687.6985 or after hours, call (833) 960-3006 to reach Fonemed, a telehealth service; individuals may ask for the on-call CAPS clinician

Title IX Sexual Misconduct Statement Please know that as faculty members, we are committed to supporting survivors of sexual misconduct, including relationship violence and sexual assault. However, university policy also requires us to report any disclosures about sexual misconduct to the Title IX Coordinator, whose role is to coordinate the University’s response to sexual misconduct.

Georgetown has a number of fully confidential professional resources who can provide support and assistance to survivors of sexual assault and other forms of sexual misconduct. These resources include:

  • Getting Help
  • Jen Schweer, MA, LPC
    Associate Director of Health Education Services for Sexual Assault Response and Prevention (202) 687-032
    jls242@georgetown.edu
  • Erica Shirley, Trauma Specialist
    Counseling and Psychiatric Services (CAPS)
    (202) 687-6985
    els54@georgetown.edu

Threat Assessment

Georgetown University established its Threat Assessment program as part of an extensive emergency planning initiative. The program at Georgetown has been developed and implemented to meet current best practices and national standards for hazard planning in higher education institutions and workplace violence prevention.

Special Accommodations

If you believe that you have a disability that will affect your performance in this class, don’t hesitate to get in touch with the Academic Resource Center for further information. The center is located in the Leavey Center, Suite 335. The Academic Resource Center is the campus office responsible for reviewing documentation provided by students with disabilities and determining reasonable accommodations according to the Americans with Disabilities Act (ADA) and University policies.