Dataset and topic proposal milestone
Dataset and Topic Requirements
- The project topic is up to you and you can use any publicly available dataset(s). Pick a topic that interests you.
- Our expectation is that you will use multiple datasets that you can join or layer in a way that makes sense, adds richness, and enables a variety of visualization types.
- No proprietary datasets are allowed. All data must be available and accessible from public sources. Data behind a login is acceptable as long as anyone else can access the data (with their own login.) References to all datasets used must be provided with links.
- Different groups may use the same dataset(s) with the condition that each group creates a unique narrative with analyses and visualizations. In the case that we receive two very similar proposals, preference will be given to the group that submits earlier.
Use rich datasets
Ideally, your dataset and/or combination of datasets should include all the data types:
- Both qualitative and quantitative data
- A time element
- A geospatial element
- A text element
- A relationship element (the ability to be transformed into a graph/network)
We recognize that having all of these can be a challenge or may not make sense for the context of your story. However, please attempt to do so.
There are many sources of available datasets. Please search and think beyond the obvious places. Here are some suggestions:
- https://github.com/BuzzFeedNews/everything
- https://www.reddit.com/r/datasets
- https://github.com/awesomedata/awesome-public-datasets
- https://www.ehdp.com/links/datasets.html
- https://www.quora.com/Where-can-I-find-large-datasets-open-to-the-public
- https://cloud.google.com/bigquery/public-data
- https://www.data-is-plural.com/
Not allowed
You may not use any of the following datasets:
- Kaggle datasets. Kaggle datasets are primarily designed for machine learning and modeling. While you can build visualizations from these datasets, often times there are not nough data points and/or variables or the data itself is masked or obfuscated.
- New York City Taxicab
- Airline Delays
- Amazon or Yelp Reviews
- Iris dataset
- Penguins dataset
- Any datasets used in labs, lectures, or homework assignments
- Any COVID related dataset
- Any proprietary or paid dataset
Submission
Please coordinate amongst your teammates and submit only one form per team. No changes are permitted after you submit. You must be logged into GU Google and the respondent’s email address will be logged. If there is more than one submission per team, only the earliest submission will be considered.
Please fill out this Google form with the following information:
- Your team number as defined in Canvas
- What dataset(s) are you planning to use? Provide a brief description of the dataset(s). You must provide source URLs so we can take a look. Make sure that the URLs for your data sources are correct and functional.
- Why do you want to use this data? Please provide reasons why you are interested in this particular data, not what or how you are going to use it.
- Acceptable response: We would like to use wildfire data because wildfires have been a recent hot topic in the news lately, especially in areas of the United States experiencing droughts, such as California
- Not-acceptable response: We would like to use wildfire data because would like to explore the topic generally to produce interesting findings, but also to see if there are any larger implications of wildfires and their relationship to climate change.
- What do you wish to explore? Provide some preliminary ideas on what questions you would like to answer and how you intend to you about it.
- Do you believe your data is rich enough to allow you successfully complete a comprehensive analysis and story? Please explain.
You may want to write these answers in a text editor before submitting the form and then cut/paste your responses.
We will reject dataset/topic proposals that:
- Do not have working URLs to the source
- Do not provide thoughtful answers
- Intend to use trivial datasets that are not rich enough to allow you to do a comprehensive project