Syllabus
Instructors
TAs
Communication
Course description
Data is everywhere! In today’s data driven world, very often you will find yourself with a dataset that is just too big to be analyzed with traditional programming libraries on your laptop or workstation. That is where modern open source projects, cloud providers, and distributed computation/processing save the day.
This is a practical, workshop-style course about using cloud computing to do analysis and manipulation of datasets that are too large to fit on a single machine or analyzed with traditional tools. You will have the opportunity to use m ` tools and techniques discussed in class. Although this is not a programming course per se, there is programming involved.
You will understand how to ingest the data, and then massage, clean, transform, analyze, and model it within the context of big data analytics. You will be able to think more programmatically and logically about your big data needs, tools, and issues.
Learning objectives
- Select, configure, and use the approproate tools and cloud infrastructure to work with datasets
- Process and analyze large datasets using scalable approaches
- Learn the services offered by Microsoft Azure and Amazon Web Services
- Use big data processing skills, including git and the Linux command line
- Execute a big data analytics project from start to finish: process, analyze, model, and communicate results through written and verbal methods
- Understand the steps required to scale from interactive scripting to unattended jobs
Pre-requisites
- Experience with Python and SQL. Note: The primary language is Python
- Experience with git and GitHub
Some tutorials to brush up on these skills:
- git - the simple guide
- Nico Riedmann’s Learn git concepts, not commands
- SQLBolt - Learn SQL with simple, interactive exercises
- The first six videos listed in The Missing Semester of your CS Education
- The DSAN Bootcamp materials
Required resources
Computer
You should have a laptop (no Chromebooks, please). Windows, Mac or Linux machines are acceptable. Please bring your machine to class.
Cloud accounts
We will discuss how to setup your account(s) and environment(s) in class and lab within the first couple of weeks. You will get credits that will be enough to support your coursework throughout the semester.
Learning activities
Class format
The course meetings follow a split lecture/lab format. All class meetings will have a lecture portion, and most sessions will have an in-class lab portion.
During the lecture portion, we will discuss concepts, techniques, cloud services, open-source tools, and explore the tools’ history and development
During the lab portion, we will usually perform some a short demonstration, and then you will complete exercises and follow examples which are designed to show you how to implement the ideas and concepts with various tools.
Readings
On certain weeks, readings will be assigned to prepare you for the lecture material being presented. These readings should take an hour or less per week. Reading materials will be provided through in PDF format via Canvas.
Online Quizzes
There will be unannounced quizzes a few times during the semester, at random intervals and times. The quizzes ensure you are keeping up with the material presented in the class. The material for the quizzes will be drawn from lectures, labs, and readings.
Lab completions
Most labs will have a deliverable. Completing the labs is essential for you to learn the skills presented in class.
The lab deliverables can sometimes be completed during lab time, however, it is your responsibility to complete the deliverable as part of your work outside of lecture/lab time.
Homework assignments
There will be several homework assignments. The goal of these problem sets is to hone your big data skills by answering some questions about large datasets. The problem sets will build on the labs and will be much more in-depth. Deliverables from the assignment will usually include code written for your programs and the output produced.
Big Data analytics project
You will assemble into groups of 3 to 4 students in any section. You will perform and write up an analysis of a big dataset using the tools learned in class. Big is defined as “a dataset that is so large that you cannot work with it on a laptop.”
The details for the project will be provided within the first few weeks of the term.
Evaluation
- Assignments : 30%
- Lab completions : 20%
- Quizzes & attendance : 10%
- Group project : 40%
Total is 100%. There is no plan to curve the final grade, and the final letter grade will be:
- A: >= 92.5
- A-: >= 89.5, < 92.5
- B+: >= 87.99, < 89.5
- B: >= 81.5, < 87.99
- B-: >= 79.5, < 81.5
- C: >= 70, < 79.5
- F: < 70
Grading philosophy
Some of the assignments you will work on are open-ended and some are not (i.e. specific tasks). Grading is generally holistic, meaning that there may not always be specific point value for individual elements of a deliverable. Each deliverable submission is unique and will be compared to all other submissions.
Deliverables that:
- Exceed the requirements and expectations are typically considered A level work.
- Just meet the requirements and expectations are typically considered A-/B+ level work.
- Do not meet the requirements are typically considered B or lesser level work.
Partial credit will be given where appropriate.
All deliverables must meet general quality requirements that are expected from students at the graduate school level as well as specific requirements that will be provided for each deliverable. Points will be deducted for any of the following reasons:
- You did not follow any direct and specific instructions
- Your deliverable has missing sections
- Your overall presentation and/or writing is sloppy
- Your code does not follow best coding practices
- Your code has no comments (including the areas where GAI was used)
- Your repository has either more or less files than those requested
- You use absolute references (file paths, urls, etc.) paths in your scripts
- You alter the repository structure in any way
- You do not use GitHub Classroom
- You do not use
git
effectively - You manually upload files to GitHub through the web and do not use
- You use incorrect file names (wrong extensions, wrong case, etc.)
- Your technical approach is fundamentally flawed
- Your analytical decisions are unjustified
Submitting your work
GitHub classroom
We use Github Classroom for all class deliverables: assignments, labs, and the final project. Submitting your work is the process of committing your files and results to your local repository and then pushing it to GitHub.
Use the final-submission
commit message
When you are ready for your work to be evaluated, you MUST use the commit message final-submission
. If you do not use the commit message final-submission
we will assume that you are still working in the repository and we will only grade what is present. By submitting that commit message, you are stating that you are finished with the assignment and are ready for feedback.
In case you need to make a correction after your final-submission
and the submission deadline has not yet passed, then you can amend your previous commit. See amending a commit for instructions. Do not change the commit message, it should continue say “final-submission” after the amend.
Late policy
In lieu of extensions, there is a tiered deduction scale if a deliverable is late. Late penalties only apply to labs and assignments.
We will assess exceptional circumstances on a case-by-case basis, and only if we are made aware before a deliverable’s deadline, not after.
- A late penalty of 10% per day, up to 4 days, will be assessed for assignments and labs that are submitted with a
final-submission
commit message after the deadline. You may still submit a missed lab or assignment up until the last day of class (May 2) with a maximum possible grade of 60%. - Missed in-class quizzes cannot be made up and will receive a grade of zero.
- Project deadlines are fixed and have no extensions or late penalty. A missed project deliverable will receive a grade of zero.
Other course policies
Attendance and punctuality
Attendance is mandatory and will be taken. Given the technical nature of this course, and the breadth of topics discussed, you are expected to attend each class, to complete all readings, and to participate actively in lectures, discussions and exercises. We understand there may be times you may need to miss class, please inform us in advance if you are not able to attend class for any reason. However, it is up to you to keep up.
Participation
We love participation. Read. Raise your hand. Ask questions. Make comments. Challenge us. Acknowledge us. If we speak for three hours to a silent classroom, it is a lot more boring and tiring for everyone.
Laptop and phone use
You must bring your laptop to class to work on labs. No phone use is allowed during lecture. You may use your laptop during lecture to take notes, but please refrain from other activities. We reserve the right to ask you to put your phones and laptops away. You may not use your computer or phone while your peers or guest speakers are presenting.
Communication and Slack Rules
- All announcements will be posted on Canvas and Slack
- Use Slack for any question you may have about the course, about assignments or any technical issue. This way everyone can learn from each others questions. We will be monitoring and providing answers on a regular basis. Make sure you understand what is allowed in Slack.
- Individual emails containing any course question that is not personal will not be answered
- Slack DMs are not to be used unless we DM you first and you can respond to our message. Students may not initiate DMs.
- Keep an eye on the questions posted in Slack. Use the search function. It’s very possible that we have already answered a question, and we reserve the right to point you to the syllabus, previous Slack messages, or other document containing the information requested
- Assignment, lab and project questions will only be answered on Slack up to 12 hours before something is due
Open Door Policy
Please approach or get in touch with us if something is not working for you regarding the class, methods, etc. Our pledge to you is to provide the best learning experience possible. If you have any issue please do not wait until the last minute to speak with us. You will find that we are fair, reasonable, and flexible and we care deeply about your learning and success.
Academic Integrity
As a Jesuit, Catholic university, committed to the education of the whole person, Georgetown expects all members of the academic community, students and faculty, to strive for excellence in scholarship and in character.The University spells out the specific minimum standards for academic integrity in its Honor Code, as well as the procedures to be followed if academic dishonesty is suspected.
Over and above the honor code, in this course we will seek to create an engaged and passionate learning environment, characterized by respect and courtesy in both our discourse and our ways of paying attention to one another.
The code of academic integrity applies to all courses at Georgetown University. Please become familiar with the code. All students are expected to maintain the highest level of academic integrity throughout the course of the semester.Please note that acts of academic dishonesty during the course will be prosecuted and harsh penalties may be sought for such acts. Students are responsible for knowing what acts constitute academic dishonesty. The code may be found at https://bulletin.georgetown.edu/regulations/honor/.
Definition of collaboration
In the spirit of fostering a collective and inclusive learning environment, we acknowledge that you will work and study with your peers. We also acknowledge that you use web resources (code examples specifically), and that in writing a program many of you will most likely use the same libraries, functions and other similar instructions in your scripts. However:
- You must write your own code. This will be verified for every assignment against every submission, and any similarity greater than 60% between students on a given assignment will be considered to be unauthorized collaboration.
- You must do your individual work in your own cloud resources. This will be verified for every assignment. We know the fingerprint of your cloud account and subscriptions and we can tell.
What is allowed
- Collaborating with other students during in-class labs to facilitate collective learning
- Using Slack for helping one-another as long as:
- You do not provide answers directly but only discuss potential approaches
- You only share up to a few lines of code for everyone’s benefit for the resolution of a specific question or issue
- Using anything (code, resources, tips, approaches, etc.) provided by the instructional team
What is forbidden
The following actions are not permitted in any way and are considered a violation of academic integrity:
- Copying and sharing code between students in individual assignments or across goups in the group project
- Sharing anything on any individual assignment
- Using code snippets found online (stack overflow, etc.) and not commenting the source
- Plagiarism of any kind
- Using any Generative Artificial Intelligence tool without acknowledging it
- Using someone else’s cloud resources
- Making your private GitHub repos public
- Sharing or posting any course materials anywhere
- Faking or tampering with git commit dates or messages
Use of Generative AI tools
We recognize the recent availability of very powerful generative AI tools like Chat-GPT, GitHub Copilot, and others. These tools can help us be more effective and we embrace their use.
Georgetown University resources and policies
Georgetown University’s Plagiarism Policy
Plagiarism or academic dishonesty in any form will not be tolerated and may result in a failing grade. All Honor Code violations will be submitted to the Honor Council.
Academic integrity is central to the learning and teaching process. Students are expected to conduct themselves in a manner that will contribute to the maintenance of academic integrity by making all reasonable efforts to prevent the occurrence of academic dishonesty. Academic dishonesty includes (but is not limited to) obtaining or giving aid on an examination, having unauthorized prior knowledge of an examination, doing work for another student, and plagiarism of all types, including copying code.
Plagiarism is the intentional or unintentional presentation of another person’s idea or product as one’s own. Plagiarism includes, but is not limited to the following: copying verbatim all or part of another’s written work; using phrases, charts, figures, illustrations, code, or mathematical/scientific solutions without citing the source; paraphrasing ideas, conclusions, or research without citing the source; and using all or part of a literary plot, poem, film, musical score, or other artistic product without attributing the work to its creator. Students can avoid unintentional plagiarism by following carefully accepted scholarly practices. Notes taken for papers and research projects should accurately record sources cited, quoted, paraphrased, or summarized sources and articles should be acknowledged in footnotes.
Honor System
All students are expected to maintain the highest standards of academic and personal integrity in pursuit of their education at Georgetown. Academic dishonesty, including plagiarism, in any form, is a serious offense, and students found in violation are subject to academic penalties that include, but are not limited to, failure of the course, termination from the program, and revocation of degrees already conferred. All students are held to the Georgetown University Honor Code. For more information about the Honor Code http://gervaseprograms.georgetown.edu/honor/
Academic Integrity and Courtesy
As a Jesuit, Catholic university committed to the education of the whole person, Georgetown expects all members of the academic community, students and faculty, to strive for excellence in scholarship and in character. The University spells out the specific minimum standards for academic integrity in its Honor Code and the procedures to be followed if academic dishonesty is suspected. Over and above the honor code, in this course, we will seek to create an engaged and passionate learning environment characterized by respect and courtesy in both our discourse and our ways of paying attention to one another.
Academic Resource Center
The Academic Resource Center (ARC) is the campus office responsible for reviewing medical documentation and determining reasonable accommodations for students with disabilities. You can reach the ARC via email at arc@georgetown.edu.
Counseling and Psychiatric Services (CAPS)
As Georgetown faculty, you are among the most important individuals in some of the students’ lives. They may turn to you when they are struggling and in times of need, or you may be one of the first to notice when they are distressed.
The CAPS website has tips for faculty on how to deal with struggling or distressed students. 202.687.6985 or after hours, call (833) 960-3006 to reach Fonemed, a telehealth service; individuals may ask for the on-call CAPS clinician.
Emergency Preparedness and HOYAlert
We encourage all faculty to become familiar with Georgetown’s Office of Emergency Management and sign up for HOYAlert to receive important safety and University operating status updates. Faculty teaching at the Georgetown Downtown campus might also want to sign up for AlertDC to obtain safety and traffic updates.
Office of Institutional Compliance and Ethics
The Office of Institutional Compliance and Ethics supports and coordinates many compliance-related activities the University undertakes. With the endorsement and assistance of the University’s senior leadership, this Office is responsible for leading the development, implementation, and operation of the Georgetown Institutional Compliance and Ethics Program.
Office of Institutional Diversity, Equity and Affirmative Action (IDEAA)
The mission of IDEAA is to promote a deep understanding and appreciation among the diverse members of the University community to result in justice and equality in educational, employment, and contracting opportunities, as well as to lead efforts to create an inclusive academic and work environment.
Title IX/Sexual Misconduct
Georgetown University and its faculty are committed to supporting survivors and those impacted by sexual misconduct, which includes sexual assault, sexual harassment, relationship violence, and stalking. Georgetown requires faculty members unless otherwise designated as confidential, to report all disclosures of sexual misconduct to the University Title IX Coordinator or a Deputy Title IX Coordinator. Suppose you disclose an incident of sexual misconduct to a professor in or outside of the classroom (except disclosures in papers). In that case, that faculty member must report the incident to the Title IX Coordinator or Deputy Title IX Coordinator. The coordinator will, in turn, reach out to the student to provide support, resources, and the option to meet—[Please note that the student is not required to meet with the Title IX coordinator.]. More information about reporting options and resources can be found on the Sexual Misconduct Website.
If you would prefer to speak to someone confidentially, Georgetown has a number of fully confidential professional resources that can provide support and assistance. These resources include:
- Health Education Services for Sexual Assault Response and Prevention: confidential email sarp@georgetown.edu
- Counseling and Psychiatric Services (CAPS): 202.687.6985 or after hours, call (833) 960-3006 to reach Fonemed, a telehealth service; individuals may ask for the on-call CAPS clinician
Title IX Sexual Misconduct Statement Please know that as faculty members, we are committed to supporting survivors of sexual misconduct, including relationship violence and sexual assault. However, university policy also requires us to report any disclosures about sexual misconduct to the Title IX Coordinator, whose role is to coordinate the University’s response to sexual misconduct.
Georgetown has a number of fully confidential professional resources who can provide support and assistance to survivors of sexual assault and other forms of sexual misconduct. These resources include:
- Getting Help
- Jen Schweer, MA, LPC
Associate Director of Health Education Services for Sexual Assault Response and Prevention (202) 687-032
jls242@georgetown.edu - Erica Shirley, Trauma Specialist
Counseling and Psychiatric Services (CAPS)
(202) 687-6985
els54@georgetown.edu
Threat Assessment
Georgetown University established its Threat Assessment program as part of an extensive emergency planning initiative. The program at Georgetown has been developed and implemented to meet current best practices and national standards for hazard planning in higher education institutions and workplace violence prevention.
Special Accommodations
If you believe that you have a disability that will affect your performance in this class, don’t hesitate to get in touch with the Academic Resource Center for further information. The center is located in the Leavey Center, Suite 335. The Academic Resource Center is the campus office responsible for reviewing documentation provided by students with disabilities and determining reasonable accommodations according to the Americans with Disabilities Act (ADA) and University policies.