Milestone 1: Frame your analysis and EDA

Published

Monday Oct 30, 2023 at 8:10 am

You should thoroughly read through the entire assignment before beginning your work! Don’t start costly cloud resources until you are ready.

Code and data saved locally on the cluster will be lost when the cluster terminates. If you want to keep data, you must store it in a blob storage location. You must store the results (i.e. any plots, csv files) in the data folder of this repo.

First time setup

THIS SECTION IS IN PROGRESS AS INFRASTRUCTURE IS BEING FINALIZED

If all of the following steps are successful then it means your environment is set up correctly and you are all set to begin your project.

  1. Set up your git credentials.

  2. Create a shared cluster for your team.

  3. Clone this repo.

  4. Create a notebook, call it project_eda.

  5. Follow the example notebook steps to interact with the data in your preferred computing environment.

  6. Check the changed files (this would be the project_eda notebook) in the repo.

  7. Export the project_eda notebook to IPython Notebook and check in the project_eda.ipynb file in the repo.

Submission Details

In this assignment, you will start working with your Reddit data, decide on the scope of your project, and outline your 10 business goals.

You will use the file project_starter_script.py to get your Reddit data from Azure Blob Storage, note that you do not need to copy the data to your local environment. The project_starter_script.py notebook also contains sample code for a basic EDA example and saves the results in the local repo so that it can be checked in. It also has an example of saving intermediate big data.

You will develop EDA notebook(s) using PySpark. All your big data analysis must be in PySpark. In this assignment, you will examine the dataset, make transformations of the data, produce summary statistics and graphs of the data, and answer some of your business goals that only require exploratory work. You may choose to put all your work into one notebook or you may choose to separate it. Either is fine!

Output data sizes

REMEMBER!!! All the output you are making MUST MUST MUST be small data. Can you make a graph of all 1 million+ rows of spark data? NO! You MUST MUST MUST take that big data and collapse it into reasonable data to put into a table or a graph. You should never collect more than ~10,000 rows back to you.

Minimum requirements:

  1. Make a project plan for your Reddit data with 10 topics that you plan on exploring.

    Propose 10 different avenues of analysis for your data.

    Any good data science project can be broken into at least 10 topics. These topics should vary in complexity to include exploratory (2-3), NLP (3-5), and ML (3-5) ideas. Each entry of your 10 must include the “business goal” as well as the “technical proposal” for finding the answers. We want to see the “Executive Summary” view of the questions as well as the “Data Science” plans for making it happen. The business goal cannot have any technical language and must be accessible to an audience without a data science background.

    Example question based on the data science subreddit https://www.reddit.com/r/datascience/Links to an external site.:

    Business goal: Determine the most popular programming languages and the most effective programming languages used to conduct geospatial data analysis.

    Technical proposal: Use NLP to identify posts that mention geospatial terms and one or more programming languages. Conduct counts of which programming languages are mentioned the most along with these geospatial terms. Analyze counts over time to check for major deviations and identify the leaders. Conduct sentiment analysis of the posts to assign positive or negative values to programming languages. Present findings for volume metrics and sentiment analysis for the top 5 programming languages to answer the “popular” and “effective” insights for geospatial analysis.

    Each business goal must be 1-2 sentences while each technical proposal must be at least 3 sentences. There must be enough details about your plans so you can get feedback. Include these business requirements in a markdown cell in the project_eda notebook.

  2. Conduct your exploratory data analysis.

  • Report on the basic info about your dataset. What are the interesting columns? What is the schema? How many rows do you have? etc. etc.

  • Conduct basic data quality checks! Make sure there are no missing values, check the length of the comments, and remove rows of data that might be corrupted. Even if you think all your data is perfect, you still need to demonstrate that with your analysis.

  • Produce at least 5 interesting graphs about your dataset. Think about the dimensions that are interesting for your Reddit data! There are millions of choices. Make sure your graphs are connected to your business questions.

  • Produce at least 3 interesting summary tables about your dataset. You can decide how to split up your data into categories, time slices, etc. There are infinite ways you can make summary statistics. Be unique, creative, and interesting!

  • Use data transformations to make AT LEAST 3 new variables that are relevant to your business questions. We cannot be more specific because this depends on your project and what you want to explore!

  • Implement regex searches for specific keywords of interest to produce dummy variables and then make statistics that are related to your business questions. Note, that you DO NOT have to do textual cleaning of the data at this point. The next assignment on NLP will focus on the textual cleaning and analysis aspect.

  • Find some type of external data to join onto your Reddit data. Don’t know what to pick? Consider a time-related dataset. Stock prices, game details over time, active users on a platform, sports scores, covid cases, etc., etc. While you may not need to join this external data with your entire dataset, you must have at least one analysis that connects to external data. You do not have to join the external data and analyze it yet, just find it.

  • If you are planning to make any custom datasets that are derived from your Reddit data, make them now. These datasets might be graph-focused, or maybe they are time series focused, it is completely up to you!

Create a website using your project_eda notebook. This could be as simple as publishing your notebook on GitHub pages.

  • Here is a simple example that describes how to publish a notebook to GitHub.
  • When you publish to GitHub, it will give you a warning that this repo is private but the published website would be Public, which is OK.

We expect you to put significant effort into this assignment. This assignment only requires the Jupyter notebooks.

Submitting the Assignment

You will follow the submission process for all labs and assignments:

  1. Commit and push your files
  2. Export any of the notebooks you created as IPython and/or html files, and add them to your repository from your local machine before you tag the release
  3. Tag the submission release commit with the v0.1-eda tag by the due date
  4. Make sure to push to GitHub
  5. SPECIFIC TO THIS DELIVERABLE - submit the URL for your public-facing website to the assignment on Canvas.

Make sure you commit only the files requested, and push your repository to GitHub!

The files to be committed and pushed to the repository for this assignment are:

  • README.md
  • .gitignore
  • LICENSE
  • code/project_starter_script.py
  • code/project_eda.ipynb
  • img/*
  • data/*

Make sure that your project_eda notebook includes both a list of the business problems you are solving and the charts and summary tables as described above in the minimum requirements section.

Grading Rubric

Many of the assignments you will work on are open-ended. Grading is generally holistic, meaning that there will not always be specific point value for individual elements of a deliverable. Each deliverable submission is unique and will be compared to all other submissions.

  • If a deliverable exceeds the requirements and expectations, that is considered A level work.
  • If a deliverable just meets the requirements and expectations, that is considered A-/B+ level work.
  • If a deliverable does not meet the requirements, that is considered B or lesser level work.

All deliverables must meet the following general requirements, in addition to the specific requirements of each deliverable:

If your submission meets or exceeds the requirements, is creative, is well thought-out, has proper presentation and grammar, and is at the graduate student level, then the submission will get full credit. Otherwise, partial credit will be given and deductions may be made for any of the following reasons:

Points will be deducted for any of the following reasons:

  • Any instruction is not followed
  • There are missing sections of the deliverable
  • The overall presentation and/or writing is sloppy
  • There are no comments in your code
  • There are files in the repository other than those requested
  • There are absolute filename links in your code
  • The repository structure is altered in any way
  • Files are named incorrectly (wrong extensions, wrong case, etc.)