Methods

1. How is walkabilty associated with socioeconomic outcomes in Washington, D.C.?

1.1 Data Collection

In order to investigate whether walkability has an impact on other aspects of peoples’ lives, we wanted to compare different socioeconomic outcomes of different neighborhoods depending on each one’s walkability. In order to accomplish this, we looked for walkability data, and socioeconomic outcomes that had granularity on the neighborhood level. We were able to accomplish this analysis with the following datasets:

EPA Walkability Index: This dataset compiles different features about a specific census tract¹, and computes a walkability score for each census tract. Certain D.C. neighborhoods are comprised of several census tracts, depending on the population density of that neighborhood.²
U.S. Census Bureau Community Resilience Estimates: This dataset compiles different socioeconomic factors about each census block. The ones we chose to analyze were percentage of the households that fall under the poverty threshold, high school completion rates, income inequality, and vehicle ownership.
U.S. Census Tract D.C. GeoJSON: This file is the GeoJSON map file created by the U.S. Census Bureau for Washington, D.C.. It includes a geographical information regarding the location of each neighborhood in Washington, D.C.

¹ A census tract is a geographic region defined for the purpose of taking a census. There are 179 census tracts in Washington, D.C. ² Census tracts generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people. A census tract usually covers a contiguous area; however, the spatial size of census tracts varies widely depending on the density of settlement.

1.2 Data Cleaning

In order to create a clean dataset in the format we wanted, we created a script that subsetted the raw data for D.C. in each data source and only include the columns of interest. We then joined the files by using the Census Block ID column. Most of the columns were percentages, but some of them - like the walkabilty column - were in a score that was comprised of a different range of numbers. In order to ensure that the visualizations would enable easy comparison between each of the factors, each of the outcomes columns was rescaled to have a 0-100 range. Then, we conducted some basic data, including converting missing or negative values to NaN values, renaming columns, and melting data. The JSON data with the geographic boundaries for each DC census tract did not require any cleaning.

Code

from IPython.display import display, HTML
import altair as alt
import pandas as pd
import geopandas as gpd
from pathlib import Path
import requests
import numpy as np
import matplotlib.pyplot as plt


import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

"""
IMPORT DATA
"""

# define data directory
data_dir = Path().absolute().parent.absolute().parent/"data"
write_data_dir = data_dir/"cleaned_data"
img_dir = Path().absolute().parent.absolute().parent/"data"/"img"

# import data
walkability = pd.read_csv(data_dir/"joined_depression_cre_walkability.csv")
walkability.loc[:, 'geoid_tract_20'] = walkability.geoid_tract_20.astype(str)
nation = pd.read_csv(data_dir/"cleaned_data"/"nation-joined_depression_cre_walkability.csv")

# # Ingest GEOJSON file of census tracts in DC and grab json
# req_dc = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')
# json_dc = req_dc.json()

# # create geopandas dataframe and add in the walkability / outcomes data
# geo_df = gpd.GeoDataFrame.from_features((json_dc))
# merged_df = geo_df.merge(walkability,
#                          how = 'left',
#                          left_on = 'GEOID',
#                          right_on='geoid_tract_20')
merged_df = walkability.copy(deep=True)


"""
NORMALIZE SCORES ACROSS ALL METRICS
"""

# convert the walkability score into a scale from 0 to 100 to make it more easier to interpret
# original range 1-20
# new desired range: 0-100
original_range_min = 1
original_range_max = 20
new_range_max = 100
new_range_min = 0 

merged_df.loc[:, 'walkability_score_scaled'] = merged_df.loc[:, 'walkability_score'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)
nation.loc[:, 'walkability_score_scaled'] = nation.loc[:, 'walkability_score'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)

# convert the income inequality index score into a scale from 0 to 100 to make it easier to interpret
# original range 0-1
# new desired range: 0-100
original_range_min = 0
original_range_max = 1
new_range_max = 100
new_range_min = 0 

merged_df.loc[:, 'income_inequality_gini_index'] = merged_df.loc[:, 'income_inequality_gini_index'].apply(lambda x: x if x >= 0 else np.nan)
merged_df.loc[:, 'income_inequality_gini_index_scaled'] = merged_df.loc[:, 'income_inequality_gini_index'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)
nation.loc[:, 'income_inequality_gini_index'] = nation.loc[:, 'income_inequality_gini_index'].apply(lambda x: x if x >= 0 else np.nan)
nation.loc[:, 'income_inequality_gini_index_scaled'] = nation.loc[:, 'income_inequality_gini_index'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)


# define columns to report
outcomes_cols = ['walkability_score_scaled',
                 'below_poverty_level_perc',
                 'income_inequality_gini_index_scaled',
                 'hs_grad_perc',
                 'households_no_vehicle_perc']

for i in outcomes_cols:
    merged_df[i] = merged_df[i].apply(lambda x: x if x >= 0 else np.nan)
    nation[i] = nation[i].apply(lambda x: x if x >= 0 else np.nan)
    
# flip metric to be percent of households with a car
merged_df.loc[:, 'households_w_vehicle'] = 100 - merged_df['households_no_vehicle_perc']
nation.loc[:, 'households_w_vehicle'] = 100 - nation['households_no_vehicle_perc']



"""
CLEAN COLUMN NAMES
"""
col_mapping = {'below_poverty_level_perc': '% Below Poverty Level',
               'income_inequality_gini_index_scaled': 'Income Inequality Gini Score',
               'hs_grad_perc': '% HS or Higher Degree',
               'households_w_vehicle': '% with a Vehicle',
               'walkability_score_scaled': 'Walkability Score',
               'neighborhood_name': 'Neighborhood Name'}

merged_df = merged_df.rename(col_mapping, axis='columns')


"""
RE-FORMAT DATA
"""
# turn the dataframe into long data so that the bar chart can be created with each outcome as a bar
neighborhood_df = pd.melt(merged_df,
                          id_vars = 'Neighborhood Name',
                          value_vars = col_mapping.values())

neighborhood_df = neighborhood_df.groupby(['Neighborhood Name', 'variable'])['value'].mean().reset_index()
walk_scores = dict(zip(list(neighborhood_df[neighborhood_df.variable=='Walkability Score']['Neighborhood Name']),
                       list(neighborhood_df[neighborhood_df.variable=='Walkability Score']['value'])
                      ))
neighborhood_df.loc[:, 'Walkability Score'] = neighborhood_df['Neighborhood Name'].map(walk_scores)

# reformat to get the averages
nation = nation[outcomes_cols+['households_w_vehicle']]
nation.drop('households_no_vehicle_perc', axis='columns', inplace=True)
nation_avg = pd.melt(nation,
                     value_vars = [i for i in col_mapping.keys() if 'neighborhood_name' not in i])
nation_avg = nation_avg.groupby('variable')['value'].mean().reset_index()

# create cleaned column for plotting the national averages
nation_avg['National Average'] = nation_avg['variable'].map(col_mapping)

# create DC average walkability score
neighborhood_df['dc_avg_walk'] = merged_df['Walkability Score'].mean()

# add URL to the american flag icon
nation_avg['flag_url'] = 'https://upload.wikimedia.org/wikipedia/commons/d/de/Flag_of_the_United_States.png'

# write data to CSV
neighborhood_df.to_csv(write_data_dir/"neighborhood_walkability.csv", index = False)
nation_avg.to_csv(write_data_dir/"national_walkability.csv", index = False)
merged_df.to_csv(write_data_dir/"cleaned_walkability.csv", index = False)

1.3 Data Visualization

Using the Pandas, Geopandas, and Altair libraries, we generated an interactive visualization of walkability and socio-economic outcomes in Washington, D.C. To generate the interactive visualization, we used the Altair library to create a bar chart with tract or neighborhood names on the x-axis, variable names on the y-axis, and variable values as the bar heights. We further aggregate the data at the national level by calculating the average of the variable values for all neighborhoods, and store the aggregated data in a separate dataframe.

Code

import matplotlib as mpl

# define data directories
data_dir = Path().absolute().parent.absolute().parent/"data"
write_data_dir = data_dir/"cleaned_data"
img_dir = Path().absolute().parent.absolute().parent/"data"/"img"

# read in cleaned data
neighborhood_df = pd.read_csv(write_data_dir/"neighborhood_walkability.csv")
nation_avg = pd.read_csv(write_data_dir/"national_walkability.csv")
merged_df = pd.read_csv(write_data_dir/"cleaned_walkability.csv")

# Ingest GEOJSON file of census tracts in DC and grab json
req_dc = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')
json_dc = req_dc.json()

# create geopandas dataframe and add in the walkability / outcomes data
geo_df = gpd.GeoDataFrame.from_features((json_dc))
merged_df.loc[:, 'geoid_tract_20'] = merged_df.geoid_tract_20.astype(str)
merged_df = geo_df.merge(merged_df,
                         how = 'left',
                         left_on = 'GEOID',
                         right_on='geoid_tract_20')


"""
CREATE VISUALIZATION
"""

# open custom matplotlib sheet
mpl.rc_params_from_file('style_sheet.mplstyle')  # load custom style sheet

# define a click on the chloropleth map so that it can filter the bar chart
click = alt.selection_multi(fields=['Neighborhood Name'])

# create the chloropleth map
choropleth = (alt.Chart(merged_df,
                        title = "Walkability of DC Census Tracts"
                       )
              .mark_geoshape(stroke='white')
              .transform_lookup(
                                lookup='geoid_tract_20',
                                from_=alt.LookupData(merged_df,
                                                     'geoid_tract_20',
                                                     ['Walkability Score', 'Neighborhood Name'])
              ).encode(
                    alt.Color('Walkability Score:Q',
                              scale=alt.Scale(scheme='redyellowblue',
                                              reverse=True
                                             ),
                              title = "DC Walkability"
                             ),
                    opacity=alt.condition(click,
                                          alt.value(1),
                                          alt.value(0.2)),
                    tooltip=['Neighborhood Name:N', 'Walkability Score:Q'])
              .add_selection(click)
             )

bars = (
    alt.Chart(neighborhood_df,
              title='Outcomes of DC Neighborhoods')
    .mark_bar()
    .encode(
        x = alt.X('variable:N',
                  axis=alt.Axis(labelAngle=-45)),
        color = 'mean(Walkability Score):Q',
        y = alt.Y('mean(value):Q',
                  sort='x',
                  scale = alt.Scale(domain = [0, 100])
                 ),
        tooltip = [
                 'variable:N',
                 'mean(value):Q'
                ]
    ).properties(
        width = 200,
        height = 300
    ).transform_filter(click))

# modify the axes and title labels
bars.encoding.y.title = 'Avg. Value Across All Census Tracts'
bars.encoding.x.title = 'Outcome'

nation_avg_lines = (alt.Chart(nation_avg)
                    .mark_tick(
                        color="black",
                        thickness=3,
                        size=39,  # controls width of tick
                        strokeDash=[1,2]
                    )
                    .encode(
                        x = 'National Average:N',
                        y='value:Q'
                    ))

nation_avg_img = (alt.Chart(nation_avg)
                    .mark_image(
                        width=15,
                        height=15)
                    .encode(
                        x='National Average:N',
                        y='value:Q',
                        url='flag_url',
                        tooltip = ['National Average', 'value:Q']
                    ))
                    
alt.themes.enable('vox')  # add theme
# plot the two graphs together
alt.hconcat(choropleth, (bars+nation_avg_lines+nation_avg_img))

2. How is walkability associated with health outcomes in Washington, D.C.?

2.1 Data Collection

We obtained PLACES Census Health Data Estimation data. This dataset contains model-based census tract level estimates for PLACES 2022 release. PLACES covers the entire United States at county, place, census tract, and ZIP Code Tabulation Area levels. The estimates, on the other hand, are provided by CDC’s Division of Population Health, Epidemiology and Surveillance Branch. The data sources used are Behavioral Risk Factor Surveillance System (BRFSS) 2020/2019, Census Bureau 2010 population estimates, and American Community Survey (ACS) 2015-2019 estimates. 29 of these measures were mapped at the census tract level using GIS systems.

2.2 Data Cleaning and Exploration

We first read in the PLACES Census Tract Data and conduct some basic cleaning, such as selecting only data related to Washington, D.C. and renaming columns. Then, to do some basic data exploration, we created an exploratory scatter plot using Plotly, where we can examine the relationship between pairs of user-selected health outcomes in D.C. on the x-axis and y-axis. We used a callback function, update_visibility, to update the visibility of the scatter plot traces based on the selected values from the drop-down menus.

Code

# import the csv
dc_health_df = pd.read_csv('./PLACES__Census_Tract_Data__GIS_Friendly_Format___2022_release (1).csv')

# filter for where StateAbbr = DC
dc_health_df = dc_health_df[dc_health_df['StateAbbr'] == 'DC']

# Resetting defaults and import plotly libraries
import plotly.io as pio
pio.renderers.default = "browser"
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"
import statsmodels.api as sm
import numpy as np
from sklearn.metrics import r2_score


# isolate only columns with CrudePrev in the name
dc_health_df_prev = dc_health_df.filter(regex='CrudePrev')
df = dc_health_df_prev

# Rename columns
df = df.rename(columns={'ACCESS2_CrudePrev': '% of Adults without Health Insurance', 
                        'ARTHRITIS_CrudePrev': '% of Adults with Arthritis', 
                        'BINGE_CrudePrev': '% of Adults who Binge Drink',
                        'BPHIGH_CrudePrev': '% of Adults with High Blood Pressure',
                        'BPMED_CrudePrev': '% of Adults with High Blood Pressure who take Blood Pressure Medication',
                        'CANCER_CrudePrev': '% of Adults who were Diagnosed with Cancer',
                        'CASTHMA_CrudePrev': '% of Adults who were Diagnosed with Asthma',
                        'CERVICAL_CrudePrev': '% of Women who had a Pap Smear in the Past 3 Years',
                        'CHD_CrudePrev': '% of Adults who were Diagnosed with Coronary Heart Disease',
                        'CHECKUP_CrudePrev': '% of Adults who had a Routine Checkup in the Past Year',
                        'CHOLSCREEN_CrudePrev': '% of Adults who had Cholesterol Checked in the Past 5 Years',
                        'COLON_SCREEN_CrudePrev': '% of Adults who had a Colonoscopy or similar test in the Past 10 Years',
                        'COPD_CrudePrev': '% of Adults who were Diagnosed with COPD (Chronic Obstructive Pulmonary Disease)',
                        'COREM_CrudePrev': '% Prevalence of Older Adult Men aged >=65 years who are up to date on preventative health',
                        'COREW_CrudePrev': '% Prevalence of Older Adult Women aged >=65 years who are up to date on preventative health',
                        'CSMOKING_CrudePrev': '% of Adults who Currently Smoke',
                        'DENTAL_CrudePrev': '% of Adults who had a Dental Visit in the Past Year',
                        'DEPRESSION_CrudePrev': '% of Adults who were Diagnosed with Depression',
                        'DIABETES_CrudePrev': '% of Adults who were Diagnosed with Diabetes',
                        'GHLTH_CrudePrev': '% of Adults who reported their Health as not Good',
                        'HIGHCHOL_CrudePrev': '% of Adults who were Diagnosed with High Cholesterol',
                        'KIDNEY_CrudePrev': '% of Adults who were Diagnosed with Kidney Disease',
                        'LPA_CrudePrev': '% of Adults who are Physically Inactive', 
                        'MAMMOUSE_CrudePrev': '% Women aged 50-74 years who had a Mammogram in the Past 2 Years',
                        'MHLTH_CrudePrev': '% of Adults who reported their Mental Health as not Good',
                        'OBESITY_CrudePrev': '% of Adults who were Obese',
                        'PHLTH_CrudePrev': '% of Adults who reported their Physical Health as not Good',
                        'SLEEP_CrudePrev': '% of Adults who reported their Sleep as not Good',
                        'STROKE_CrudePrev': '% of Adults who were Diagnosed with Stroke',
                        'TEETHLOST_CrudePrev': '% of Adults who have lost all of their Natural Teeth'})

# list of health metrics for drop down menu
column_names = df.columns

# Creating the initial scatter plot
fig = go.Figure(go.Scatter(x=df[column_names[0]], y=df[column_names[1]], mode='markers'))

# Label axes
fig.update_xaxes(title_text='X Axis')
fig.update_yaxes(title_text='Y Axis')

# Setting the range for x and y axes
fig.update_xaxes(range=[0, 100])
fig.update_yaxes(range=[0, 100])

for col in column_names:
    for col2 in column_names:
        x = df[col]
        y = df[col2]
        fig.add_trace(go.Scatter(x=x, y=y, mode='markers', name=col + ' vs ' + col2, showlegend=False, visible=False))


# Update the visibility of the traces
        

def update_visibility(selected_col, selected_col2):
    for i, trace in enumerate(fig.data):
        if trace.name == selected_col + ' vs ' + selected_col2:
            trace.visible = True
        elif trace.name == selected_col + ' vs ' + selected_col2 + ' Best Fit':
            trace.visible = True
        else:
            trace.visible = False

# Create the drop-down menus for x (col) and y (col2) axes of the scatter plot
col_dropdown = [{'label': col, 'value': col} for col in column_names]
col2_dropdown = [{'label': col2, 'value': col2} for col2 in column_names]

# #Define the dropdown menu for x-axis
button_layer_1_height = 1.08
x_axis_dropdown = go.layout.Updatemenu(
    buttons=list([dict(args=[{'x': [df[col]]}, update_visibility(col, col2)], label=col, method='update') for col in column_names]),
    direction="down",
    pad={"r": 10, "t": 10},
    showactive=True,
    x=0.06,
    xanchor="left",
    y=button_layer_1_height + 0.05,
    yanchor="top"
)


# Define the dropdown menu for y-axis
y_axis_dropdown = go.layout.Updatemenu(
    buttons=list([dict(args=[{'y': [df[col2]]}, update_visibility(col, col2)], label=col2, method='update') for col2 in column_names]),
    direction="down",
    pad={"r": 10, "t": 10},
    showactive=True,
    x=0.06,
    xanchor="left",
    y=button_layer_1_height,
    yanchor="top"
)




# Update the layout to include the dropdown menus
fig.update_layout(
    updatemenus=[x_axis_dropdown, y_axis_dropdown]
)

# Label axes
fig.update_xaxes(title_text='X Axis')
fig.update_yaxes(title_text='Y Axis')

# Setting the range for x and y axes
fig.update_xaxes(range=[0, 100])
fig.update_yaxes(range=[0, 100])

# Update plot sizing
fig.update_layout(
    width=900,
    height=900,
    autosize=False,
    #margin=dict(t=100, b=0, l=0, r=0),
)

# add annotations
fig.update_layout(
    annotations=[
        dict(
            text="X Axis:",
            x=0,
            xref="paper",
            y=button_layer_1_height + 0.025,
            yref="paper",
            align="left",
            showarrow=False
        ),
        dict(
            text="Y Axis:",
            x=0,
            xref="paper",
            y=button_layer_1_height - 0.025,
            yref="paper",
            align="left",
            showarrow=False
        )
    ]
)


# Change background color to grey
fig.update_layout(
    plot_bgcolor='rgb(230, 230, 230)'
)

# Change scatter point color to red
fig.update_traces(
    marker=dict(color='red')
)

# Change font to Calibri
fig.update_layout(
    font=dict(family='Proxima Nova')
)


# Display the scatter plot with dropdown menus
fig.show()

2.3 Data Visualization

Based on this we created a second graph where we explore the relationship between walkability and health outcomes. We created two drop-down menus using Plotly’s dcc.Dropdown. The selected values from the drop-down menus are passed to the update_visibility function, which then updates the scatter plot based on the selected health metrics. We then displayed the final plot using Plotly’s dcc.Graph.

3. How accessible are neighborhoods in Washington, D.C. by bike?

3.1 Data Collection

Bike lanes and the Capital Bikeshare program are two major initiatives for improving quality of life and transportation access in D.C. by the D.C. Department of Transportation. Since these two initiatives go hand in hand, we decided to analyze them together in a single visualization. For this analysis we collected the following datasets:

Capital Bikeshare Trip Data for March 2023: This dataset compiles all of the Capital Bikeshare trips completed by riders in the Washington D.C. metro area. It contains information such as start_at: starting bike dock, ended_at: ending bike dock, start_station_name: street name information for starting dock station, end_station_name: street name information for end dock station. It also includes the latitude and longitude information for the start and end locations of the rider’s trip.
U.S. Census Tract D.C. GeoJSON: This file is the GeoJSON map file created by the U.S. Census Bureau for Washington, D.C.. It includes a geographical information regarding the location of each neighborhood in Washington, D.C.
Open Data DC Bike Lane GeoJSON: This GeoJSON file provided by the D.C. Government provides geographical information regarding the various bike lanes located throughout the city.

3.2 Data Cleaning

Since the goal of this visualization was the show both the bike lane and bikeshare data on a single map, we cleaned the datasets to put them in the proper format. In the Capital Bikeshare dataset, each trip’s latitude and longitude was recorded from the point where the bike was activated at the dock. This meant that even though various rides started at the same bike dock, there were small variations in the recorded latitudes and longitudes. Thereforem, the latitude and longitude values were standardized for each bike station. Additionally, the bikeshare dataset contained data for the entire Washington D.C. metro area and since we focused our analysis to D.C. only, stations outside of the D.C. boundary were removed. Finally, any rows with incomplete trip values were removed.

The GeoJSON files did not undergo any processing and therefore were used as-is.

Code

import pandas as pd
import numpy as np
import altair as alt
import plotly.graph_objects as go
from vega_datasets import data
import requests
import json
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')

# Read in data

data_dir = Path().absolute().parent.absolute().parent/"data"
img_dir = Path().absolute().parent.absolute().parent/"data"/"img"

bikeshare_df = pd.read_csv(data_dir/"202303-capitalbikeshare-tripdata.csv")

# Convert dates into datetime format
bikeshare_df['started_at'] = pd.to_datetime(bikeshare_df['started_at'])
bikeshare_df['ended_at'] = pd.to_datetime(bikeshare_df['ended_at'])

# Drop rides with NaN values
bikeshare_df.dropna(subset=['start_station_name'], inplace = True)
bikeshare_df.dropna(subset=['end_station_name'], inplace = True)

# Standardize longitude and latitude using start station
bikeshare_df['start_lng'] = bikeshare_df['start_lng'].groupby(bikeshare_df['start_station_id']).transform('max')
bikeshare_df['start_lat'] = bikeshare_df['start_lat'].groupby(bikeshare_df['start_station_id']).transform('max')

# Create dataframe for joining
tmp = bikeshare_df[['start_station_id', 'start_lng','start_lat']]
tmp.drop_duplicates(inplace = True)

# Merge using the common station id value
bikeshare_df = bikeshare_df.merge(tmp, left_on = 'end_station_id', right_on = 'start_station_id')

# Drop repeated columns and rename them
bikeshare_df.drop(columns = ['end_lat', 'end_lng', 'start_station_id_y'], inplace = True)
bikeshare_df.rename(columns = {'start_lat_x': 'start_lat', 'start_lng_x': 'start_lng', 'start_lat_y': 'end_lat', 'start_lng_y':'end_lng', 'start_station_id_x': 'start_station_id'}, inplace = True)
bikeshare_df.to_csv(Path().absolute().parent.absolute().parent/"data/cleaned_data/bikeshare_cleaned.csv", index = False)

3.3 Data Visualization

In the Altair plot, we combined March 2023 Capital Bikeshare Data with D.C. geographical data to create a layered visual. The geographical layout is a map of D.C. with each neighborhood outlined by white. Each point represents a Capital Bikeshare Station with the size representing the amount of trips starting from that station in the month of March. The yellow lines represent the bike lanes created by the D.C. government on public streets. When the user hovers over any station, the visual will show black network lines (or edges) that connect that station (“start” station) to other stations (“end” stations), representing where people have traveled to with the Capital Bikeshare bikes. We also included tooltips that display the station information and appear when a user hovers over a station.

Code

import pandas as pd
import numpy as np
import altair as alt
import plotly.graph_objects as go
from vega_datasets import data
import requests
import json
import warnings
from pathlib import Path
warnings.filterwarnings('ignore')

bikeshare_df = pd.read_csv(Path().absolute().parent.absolute().parent/"data/cleaned_data/bikeshare_cleaned.csv")
# Create list of bikeshare stations outside of DC
nondc_stations = [
    32256,32251,32237,32241,32210,32225,32259,32223,32209,32240,32239,32245,32220,32214,32219,
    32224,32217,32213,32239,32246,32247,32250,32248,32246,32228,32215,32238,32252,32249,32260,
    32234,32231,32235,32255,32200,32208,32201,32211,32227,32207,32229,32221,32206,32233,32205,
    32204,32205,32203,32206,32222,32230,32232,32600,32602,32603,32608,32605,32604,32607,32609,
    31948,31904,32606,32601,31921,31905,31902,31901,31976,31036,31977,31900,31920,31049,31037,
    31926,31919,31035,31973,31069,31023,31022,31021,31019,31020,31094,31092,31079,31030,31029,
    31080,31093,31014,31062,31077,31073,31024,31040,31028,31017,31924,31027,31947,31066,31075,
    31949,31053,31971,31067,31058,31923,31063,31068,31951,31945,31095,31006,31005,31091,31004,
    31936,31071,31090,31950,31064,31935,31011,31012,31009,31944,31052,31010,31959,31916,31088,
    31960,31956,31910,31083,31915,31087,31085,31913,31915,31970,31969,31906,31098,31048,31081,
    31084,31082,31974,31930,31932,31953,31942,31967,32406,32423,32415,32407,32405,32401,32400,
    32405,32404,32413,32418,32410,32403,32408,32421,32402,32417,32422,32420,32414,32412,32416,
    32059,32061,32026,32011,32049,32082,32058,32025,32001,32058,32082,32024,32043,32036,32012,
    32034,32035,32050,32056,32426,32425,32424,32426,32085,32094,32089,32093,32091,32090,32087,
    32088,32086,32092,32022,32066,32064,32062,32065,32073,32063,32084,32054,32051,32040,32046,
    32029,32055,32002,32021,32003,32048,32013,32000,32008,32028,32027,32053,32039,32057,32078,
    32075,32077,32076,32079,32080,32074,32081,32032,32047,32044,32017,32007,32009,32023,32033,
    32016,32004,32005,32072,32041,32052,32071,32038,32037,32045,32067,32069,32068,32018,32253,
    32236,32243,32258,32216,32212,32218,32019,32411,31929,31914,31907,31903,31958,31933,31041,
    31042,31968,31044,31045,31955,31046,31047,31099,31043,31097,31931,31918,31086,31927,31966,
    21943,31963,31952,31964,31962,31908,31072,31941,31961,31928,31054,31033,31059,31057,31061,
    31056,31055,31909,31912,31065,31032,31074,31078,32419,31957,31954,31946,31972,31060,31938,
    31013,31002,31007,31000,31003,31096,31070,31039,31034,31025,31038,31026,31050,31940,31089,
    31031,31051,31937,31016,31018,31039,31015,31917,31076,31939,32409
]

alt.data_transformers.enable('default',max_rows=None)

#### BACKGROUND FOR DC MAP 

# Define background of Washington D.C.
response1 = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')

background = alt.Chart(alt.Data(values=response1.json()), title= "Map of D.C. Bike Lanes, Capital Bikeshare Stations, & Routes in March 2023").mark_geoshape(
        fill="lightgray",
        stroke='white',
        strokeWidth=1
    ).encode(
    ).properties(
        width=600,
        height=600
    )

#### BACKGROUND FOR DC BIKE LANE LOCATIONS 

# Open GeoJSON file for bicycle lanes
with open(data_dir/'Bicycle_Lanes.geojson') as f:
    data = json.load(f)


# Create background of D.C.
background_lanes = alt.Chart(alt.Data(values=data)).mark_geoshape(
        stroke='#d6a320',
        strokeWidth=1
        ).properties(
        width=600,
        height=600
    )



#### MOUSEOVER SELECTION

# Create mouseover selection
select_station = alt.selection_single(
    on="mouseover", nearest=True, fields=["start_station_name"], empty='none'
)

#### NETWORK CONNECTIONS FOR MAP 

# Filter non-DC stations
tmp1 = bikeshare_df[~bikeshare_df['start_station_id'].isin(nondc_stations)]
tmp1 = tmp1[~tmp1['end_station_id'].isin(nondc_stations)]

# Keep only relevant columns and drop duplicates to have one row per route
tmp1 = tmp1[['start_station_name', 'start_station_id', 'end_station_name', 'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng']].drop_duplicates()

# Define connections
connections = alt.Chart(tmp1).mark_rule(opacity=0.35).encode(
    latitude="start_lat:Q",
    longitude="start_lng:Q",
    latitude2="end_lat:Q",
    longitude2="end_lng:Q"
).transform_filter(
    select_station
)

#### POINTS FOR MAP 

# Filter non-DC stations
tmp2 = bikeshare_df[~bikeshare_df['start_station_id'].isin(nondc_stations)]
tmp2 = tmp2[~tmp2['end_station_id'].isin(nondc_stations)]

# Create boolean columns for rideable type and membet type
tmp2['classic_bike'] = np.where(tmp2['rideable_type'] == 'classic_bike', 1, 0)
tmp2['electric_bike'] = np.where(tmp2['rideable_type'] == 'electric_bike', 1, 0)
tmp2['docked_bike'] = np.where(tmp2['rideable_type'] == 'docked_bike', 1, 0)

# Temporary dataframe showing unique station locations with ride count
tmp2 = tmp2[['start_station_name','start_station_id', 'start_lng', 'start_lat', 'ride_id', 'classic_bike', 'electric_bike', 'docked_bike']].groupby(['start_station_name', 'start_station_id','start_lng', 'start_lat']).agg({'ride_id': 'count', 'classic_bike': 'sum', 'electric_bike':'sum', 'docked_bike':'sum'}).reset_index()
tmp2.rename(columns= {'ride_id':'count_rides', 'classic_bike': 'count_classic', 'electric_bike': 'count_electric', 'docked_bike': 'count_dock'}, inplace = True)
tmp2['color'] = 'Bike Station'

points = alt.Chart(tmp2).mark_circle().encode(
    latitude="start_lat:Q",
    longitude="start_lng:Q",
    color = alt.Color('color:N', title = "Legend", scale = alt.Scale(domain=['Bike Station', 'Bike Lane'],range=['#962e2ec8', '#d6a320'])),
    size=alt.Size("count_rides:Q", scale=alt.Scale(range=[15, 250]), legend=None),
    order=alt.Order("count_rides:Q", sort="descending"),
    tooltip=[
             alt.Tooltip('start_station_id:Q', title='\U0001f4cd Start Station ID'),
             alt.Tooltip('start_station_name:N', title='\U0001f6e3 Start Station Name'),
             alt.Tooltip('count_rides:Q', title='\U0001f50d Ride Count'),
             alt.Tooltip('count_classic:Q', title='\U0001f6b4 Classic Bike Count'),
             alt.Tooltip('count_electric:Q', title='\U0001f6b5 Electric Bike Count'),
             alt.Tooltip('count_dock:Q', title='\U0001f6b2 Docked Bike Count')
             ]
).add_selection(
    select_station
)

alt.themes.enable('vox')  # add theme
# Show visualization
(background + background_lanes + connections + points).configure_view(stroke=None)

4. What is metro ridership in Washington, D.C. like?

4.1 Data Collection

To obtain information about metro ridership, we went to WMATA, which has its own ridership data portal. The data is collected from the Metrorail Ridership Year-over-Year Change dashboard here. Only data for March 2023 is available, so then we went to the download button in the bottom right corner of the platform to downoload this data. For more information about the data collection method, please reach out to “planning_ridership@wmata.com”.

4.2 Data Cleaning

In answering this question, we built an interactive timeseries plot of the Washington Metropolitan Area Transit Authority (WMATA) Metro entries by selected station and date in March 2023 using the Plotly library. Then, we cleaned the data by renaming the ‘Date_This_Year’ and ‘Entry_This_Year’ columns to ‘Date’ and ‘Entries’ respectively for clarity. We also pivoted the dataframe so that number of entries are shown by station in columns and date in rows. Then, we converted the index column, which contains the dates, to datetime format. We also organized the index from earliest to latest date. Finally, we selected the stations Anacostia, Stadium-Armory, Van Ness-UDC, Shaw-Howard Univ, Gallery Place, and Capitol South from the dataframe because the first three neighborhoods correspond to three of the least walkable neighborhoods in Washington, D.C. and the last three neighborhoods correspond to the most walkable neighborhoods. Otherwise, if we included all stations’ data, the plot would be too crowded and confusing. We saved the cleaned data to a csv file.

Code

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
data_dir = Path().absolute().parent.absolute().parent/"data"
write_data_dir = data_dir/"cleaned_data"
# Read in data
df = pd.read_csv(data_dir / 'wmata.csv', encoding='utf-16', delimiter="\t")
# Remove columns Servicetype_This_Year(group), Holiday_Last_Year, Holiday_This_Year, Servicetype_This_Year, Time_Period, and Date_Last_Year
df = df.drop(['Day_of_Date_This_Year', 'Servicetype_This_Year_(group)', 'Holiday_Last_Year', 'Holiday_This_Year', 'Servicetype_This_Year', 'Time_Period', 'Date_Last_Year', 'Entries_Last_Year'], axis=1)
# Rename columns 
df = df.rename(columns={'Date_This_Year': 'Date'})
df = df.rename(columns={'Entries_This_Year': 'Entries'})
# Pivot data
pivot_df = df.pivot_table(index='Date', columns='Station', values='Entries')
# Convert index of pivot_df to datetime
pivot_df.index = pd.to_datetime(pivot_df.index)
# Organize index of pivot_df from earliest to latest date
pivot_df = pivot_df.sort_index()
# Select stations Anacostia, Stadium-Armory, Van Ness-UDC, Shaw-Howard Univ, Gallery Place, and Capitol South from pivot_df
new_df = pivot_df[['Anacostia', 'Stadium-Armory', 'Van Ness-UDC', 'Shaw-Howard U', 'Gallery Place', 'Capitol South']]
# weekly rides per station
# Save pivot_df and new_df to csv
pivot_df.to_csv(write_data_dir / 'wmata_cleaned.csv', index= False)
new_df.to_csv(write_data_dir / 'wmata_new_cleaned.csv', index= False)
df.to_csv(write_data_dir / 'wmata_long_cleaned.csv', index= False)

4.3 Data Visualization

Using the cleaned data, we built an interactive time series plot using the plotly.graph_objects and plotly.subplots libraries. We read in the cleaned data (wmata_new_cleaned.csv) and created a list of all the unique station names in the dataset. We created a subplot with one trace per station and used a for loop to iterate through each unique station in the dataset, adding the corresponding trace to the subplot. Then, we created a dropdown menu to allow users to input a station from the list of stations available. Each station is represented in a button. When a user picks a station by selecting the corresponding button, the method updates the visibility of the trace to reflect the entries by day at the selected station. The layout of the plot is updated to include the dropdown menu and to set the title, axis labels, and fixed axis ranges. The fig.show() method to display the plot.

Code

# Build interactive timeseries plot using plotly
# Import libraries
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib as mpl
mpl.rc_params_from_file('style_sheet.mplstyle')  # load custom style sheet
# List all stations
stations = list(new_df.columns)
# Create subplot with one trace per station
fig = make_subplots(rows=1, cols=1)
for station in stations:
    fig.add_trace(
        go.Scatter(x=new_df.index, y=new_df[station], name=station),
        row=1, col=1
    )
# Create dropdown menu to select station
buttons = []
for station in stations:
    buttons.append(
        dict(method='update', label=station, args=[{'visible': [station == s for s in stations]}])
    )
dropdown = dict(
    active=0, buttons=buttons, direction='down', showactive=True, x=1.1, y=1.1
)
# Update layout
fig.update_layout(
    updatemenus=[dropdown], height=600, width=900,
    title='WMATA Metro Entries by Selected Station and Date in March 2023', xaxis_title='Date', yaxis_title='Entries',
    yaxis=dict(range=[0, 3000])
)
# Show plot
fig.show()
#### Table
# Import pivot_df from the file
df = pd.read_csv(write_data_dir / 'wmata_long_cleaned.csv')
df.Date = pd.to_datetime(df.Date)
weekly_df = df.groupby([pd.Grouper(key='Date', freq='W'), 'Station'])[
    'Entries'].sum().reset_index()
# sort so that the one with most ridership per week is on top
weekly_df = weekly_df.sort_values(
    by=['Date', 'Entries'], ascending=[True, False])
# format the dates so they look cleaner in the table
weekly_df.Date = weekly_df.Date.dt.strftime('%B %-d')
# format number of rides
weekly_df['Entries'] = weekly_df['Entries'].apply(
    lambda x: "{:,}".format(int(x)))
headerColor = '#962E2E'  # red part of theme
rowEvenColor = '#FDE6AB'  # ligher yellow part of theme
rowEvenColor2 = 'lightgray'  # ligher yellow part of theme
rowOddColor = 'white'
num_stations = weekly_df.Station.nunique()
num_days = weekly_df.Date.nunique()
table_fig = go.Figure(data=[go.Table(
    header=dict(
        values=['Start of Week', 'Metro Station',
                'Num. of Riders over 7 Days'],
        line_color='darkslategray',
        fill_color=headerColor,
        align=['center'],
        font=dict(color='white', size=12)
    ),
    cells=dict(
        values=[weekly_df.Date, weekly_df.Station, weekly_df.Entries],
        line_color='darkslategray',
        # 2-D list of colors for alternating rows
        fill_color=[[rowOddColor, rowEvenColor]*(int(len(df)/2)+len(df) % 2)],
        # fill_color = [([rowOddColor]*num_stations+[rowEvenColor]*num_stations)*(int(num_days/2)+ (num_days%2))],
        # fill_color = [([rowOddColor,rowEvenColor]*(int(num_stations/2)+(num_stations%2))+[rowOddColor,rowEvenColor2]*(int(num_stations/2)+(num_stations%2)))*(int(len(df)/(num_stations*2)) + len(df)%(num_stations*2))],
        align=['center'],
        font=dict(color='darkslategray', size=11)
    ))
])
table_fig.update_layout(title_text='Weekly D.C. Metro Ridership')
table_fig.update_layout({'margin': {'t': 50
                                    }})
table_fig.show()
# table_fig = go.Figure(data=[go.Table(
#     header=dict(values=list(df.columns),
#                 fill_color="#962E2E",
#                 align='left'),
#     cells=dict(values=[df.Date, df.Station, df.Entries],
#                fill_color='#FDFDFD',
#                align='left'))
# ])
# table_fig.show()

5. What is public sentiment around walkability in Washington, D.C.?

5.1 Data Collection

To collect the sentiment of the public regarding different transportation methods we used the Reddit API to scrap answers to posts using keywords such as “Walkability”, “Cars” and “Bicycles”. The API would collect the opinions from the thread URL we are scrapping and finally a dataframe was generated with the content of each post.

Code

import praw
import pandas as pd

# Creating connection to API 
reddit_read_only = praw.Reddit(client_id="",
                            client_secret="",
                            user_agent="Walkability_DC")

subreddit = reddit_read_only.subreddit("redditdev")

print("Name:", subreddit.display_name)
print("Title:", subreddit.title)
print("Description:", subreddit.description)

# Gather url of predefined topics
topics = {"bikes":[bikes_urls],
          "cars":[cars_urls],
          "walk":[walk_urls],
        }

# Scrapping data by url
for key in topics.keys():
    comments = []
    for url in topics[key][0]:
        try:
                submission = reddit_read_only.submission(url=url)
        except:
             print("incorrect url:",url)
             continue
        submission.comments.replace_more(limit=0)
        for comment in submission.comments.list():
            comments.append(comment.body)
    topics[key].append(comments)

# Generation of dataframe with all threads per category
df_reddit = pd.DataFrame(data=[topics["bikes"][1],topics["cars"][1],topics["walk"][1]]).T
df_reddit = df_reddit.rename(columns={0:"bikes", 1:"cars", 2:"walk"})

df_reddit.to_csv("../data/reddit.csv",index=False)

5.2 Data Cleaning

A Natural Language Analysis was performed to: remove non-letter characters

Code

import pandas as pd
import re
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import tensorflow as tf
import pandas as pd
import random
import numpy as np
import torch as torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from pysentimiento import create_analyzer
import json

#opening previously obtained data
df = pd.read_csv('../data/reddit.csv')
df = df.fillna("")
df.head()

# function to clean strings
def remove_nonletters(strings):
    pattern = re.compile(r"[^a-zA-Z\s]|[\n]+")

    cleaned_strings = [(pattern.sub("", i)).lower() for i in strings]
    return cleaned_strings

for col in df.columns:
    col_clean = str(col)+"_clean"
    df[col_clean] = remove_nonletters(df[col])

After cleaning characters we tokenized the texts to generate wordclous per category.

Code

#Cleaning stopwords

import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import re

s_words=list(stopwords.words('english'))
extra_s_words = ["the","to","a","and","of","i","is","you","it","i","is","ive","youve","im","dc","youre","dont","would","also","might"]
for i in extra_s_words:
     s_words.append(i)

pattern = re.compile(r'\s*\b(' + '|'.join(s_words) + r')\b\s*', re.IGNORECASE)

x = ["bikes_clean",
     "cars_clean",
     "walk_clean"]

words = {}

for col in x:
    tokens = [word_tokenize(pattern.sub(' ', i)) for i in df[col]]
    words[col] = " ".join([word for sentence in tokens for word in sentence])
words

# Obtaining the frequency of each word
fdists= {}

for key in words.keys():
    tokenized_words = nltk.tokenize.word_tokenize(words[key])
    fdist = FreqDist(tokenized_words)
    fdists[key] = fdist
fdists

#obtaining the top frequent values and generating string to be consumed by the script to generate the wordcloud vis 
words_freq = []
for key_n in fdists.keys():
    fdist = (dict(fdists[key_n].most_common(30)))
    result =""
    total =  sum(fdist.values())
    for key, value in fdist.items():
        n = (value*100) // total
        result+= (key+" ")*n
    words_freq.append(result)
words_freq

Furthermore, to obtain the sentiment of each opinion we used a pre-trained BERT model to predict Negative, Neutral and Positive sentiment.

Code

#Calling a bert tranformer
analyzer = create_analyzer(task="sentiment", lang="es")

#predicting sentiment using analyzer 

for i in ["walk","bikes","cars"]:
    file_name = "../data/"+i+".txt"
    df = pd.read_csv(file_name)
    df = df.dropna
    col = i+"_clean"
    sentiment = [analyzer.predict(i) for i in df[col]]
    sent_label = [i.output for i in sentiment]
    sent_prob = [i.probas for i in sentiment]

    df_probs = pd.DataFrame(sent_prob)
    df_probs["label"] = sent_label
    df_probs["walk"] = df["walk_clean"]
    
    file_name = "../data/"+i+".csv"
    df_probs.to_csv(file_name)

#opening previously modified sentiment analysis
b_sent = pd.read_csv("../data/bikes_sent.csv")
c_sent = pd.read_csv("../data/cars_sent.csv")
w_sent = pd.read_csv("../data/walk_sent.csv")

# drop index
b_sent = b_sent.drop(columns=['Unnamed: 0'])
c_sent = c_sent.drop(columns=['Unnamed: 0'])
w_sent = w_sent.drop(columns=['Unnamed: 0'])

# add category
b_sent["category"] = "bike"
c_sent["category"] = "car"
w_sent["category"] = "walk"

b_sent = b_sent.rename(columns={"bikes": "text"})
c_sent = c_sent.rename(columns={"cars": "text"})
w_sent = w_sent.rename(columns={"walk": "text"})

#generate a single dataframe
df_sent = b_sent.append([c_sent,w_sent], ignore_index = True)
df_sent.shape

Finally to have a better understanding of the sentiment we want to analyze the bigrams of words in each category to understand the words behind the opinions.

Code

import nltk
from functools import reduce
# Obtaining bigrams and generating a dictionary with their frequency

bigram_freqs = []
j = 0

categories = [words[cat] for cat in words.keys()]
flattened_text = " ".join([text for text in [words['bikes_clean'],words['cars_clean'],words['walk_clean']]])
categories.append(flattened_text)

for cat in categories:
    #obtaining bigrams per category
    x = pattern.sub(' ',cat )
    x = x.replace("  "," ").replace("  "," ")
    tokens = nltk.word_tokenize(x)
    bigrams = list(nltk.bigrams(tokens))

    y = {i:0 for i in set(bigrams)}

    for bigram in bigrams:
        y[bigram] += 1
    bigram_freqs.append(dict(y))

"""
To observe the bigrams on a network diagram we want to store the values and their frequency on a json file with the next syntaxis:
data = {"nodes" : [{"name":"A Chifolleau","n":1,"grp":1,"id":"A Chifolleau"}
        "links" =[[{"source":"A Besnard","target":"M Ardisson","value":1},{"source":"A Besnard","target":"Y Holtz","value":1},
"""
def transform_dict(input_dict):
    unique_words = set()
    for key in input_dict.keys():
        unique_words.add(key[0])
        unique_words.add(key[1])

    word_to_id = {word: i+1 for i, word in enumerate(sorted(unique_words))}

    #nodes = [{'name': word, 'value': word_to_id[word], 'colour': '#bde0f6'} for word in sorted(unique_words)]
    nodes = [{'name': word, 'n': word_to_id[word], "grp":1, "id":word} for word in sorted(unique_words)]

    links = []
    for key, value in input_dict.items():
        source = word_to_id[key[0]]
        target = word_to_id[key[1]]
        #links.append({'source': word, 'target': target, 'value': value})
        links.append({'source': key[0], 'target': key[1], 'value': value})

    return {'nodes': nodes, 'links': links}


#Function to reduce the dictionary to contain values only with 3 connections or more
def filter_dict(d):
    return {k: v for k, v in d.items() if v >= 3}


filtered_bigrams = [filter_dict(i) for i in bigram_freqs]
networks =[transform_dict(i) for i in filtered_bigrams]

#storing all dictionaries in json files

cat = ["bikes", "cars", "walk", "all"]
for i in range(0,len(cat)):
    filename = "../data/bigrams_"+cat[i]+".json"
    with open(filename, 'w') as fp:
        json.dump(networks[i],fp,indent=4)

5.3 Data Visualization

We generated visualizations using Plotly library in Python to analyze sentiment polarity of comments related to different categories (bike, walk, and car). We first load and clean data from three different CSV files (b_sent.csv, c_sent.csv, and w_sent.csv) which contain sentiment analysis results for reddit comments related to bikes, cars, and walking, respectively. Next, we create three box plots using Plotly, one for each category of transport. We plotted the sentiment polarity values on the y-axis, and the categories on the x-axis. After the first graph, we create a 2D histogram contour plot and two more histograms. The contour plot (trace2) shows the density of sentiment polarity values for each category, with darker areas indicating higher density. The two histograms (trace3 and trace4) show the distribution of sentiment polarity values along the x-axis and categories along the y-axis separately. The final visualizations provide insights into the sentiment polarity of comments related to different categories, allowing for a comparison of sentiment distribution across categories.

Code

trace0 = go.Box(
    y=y0,
    name = 'bike',
    marker = dict(
        color = 'rgba(148, 46, 46, 0.784)',
    ),
    notched=True
)
trace1 = go.Box(
    y=y1,
    name = 'walk',
    marker = dict(
        color = 'rgb(153, 91, 40)',
    ),
    notched=True
)
trace2 = go.Box(
    y=y2,
    name = 'car',
    marker = dict(
        color = 'rgb(214, 163, 32)',
    ),
    notched=True
)

data = [trace0, trace1, trace2]
layout = go.Layout(
    title = "Sentiment polarity boxplot by category"
)

fig = go.Figure(data=data,layout=layout)
iplot(fig, filename = "Sentiment polarity boxplot by category")


trace1 = go.Scatter(
    x=df_sent['polarity'], y=df_sent['category'], mode='markers', name='points',
    marker=dict(color='rgb(102,0,0)', size=2, opacity=0.4)
)
trace2 = go.Histogram2dContour(
    x=df_sent['polarity'], y=df_sent['category'], name='density', ncontours=30,
    colorscale='Hot', reversescale=True, showscale=False,
    hovertemplate='<br>Polarity: %{x}<br>Category: %{y}<br>Comments: %{z}'
)
trace3 = go.Histogram(
    x=df_sent['polarity'], name='Polarity',
    marker=dict(color='rgb(214, 163, 32)'),
    yaxis='y2',
    hovertemplate='<br>%{x}<br>'
)
trace4 = go.Histogram(
    y=df_sent['category'], name='Category', marker=dict(color='rgba(148, 46, 46, 0.784)'),
    xaxis='x2'
)
data = [trace1, trace2, trace3, trace4]

layout = go.Layout(
    showlegend=False,
    autosize=False,
    width=800,
    height=750,
    xaxis=dict(
        domain=[0, 0.85],
        showgrid=False,
        zeroline=False
    ),
    yaxis=dict(
        domain=[0, 0.85],
        showgrid=False,
        zeroline=False
    ),
    margin=dict(
        t=50
    ),
    hovermode='closest',
    bargap=0,
    xaxis2=dict(
        domain=[0.85, 1],
        showgrid=False,
        zeroline=False
    ),
    yaxis2=dict(
        domain=[0.85, 1],
        showgrid=False,
        zeroline=False
    )
)

fig = go.Figure(data=data, layout=layout)
fig.update_layout(yaxis_title="Category",xaxis_title="Polarity", title="Sentiment polarity of posts by category") 

iplot(fig, filename='2dhistogram-2d-density-plot-subplots')

Code

chart = {
  const links = data.links.map(d => Object.create(d));
  const nodes = data.nodes.map(d => Object.create(d));

  const simulation = d3.forceSimulation(nodes)
      .force("link", d3.forceLink(links).id(d => d.id).distance(40))
      .force("charge", d3.forceManyBody().strength(-40))
      .force('collision', d3.forceCollide().radius(10))
      .force("center", d3.forceCenter(width / 2, height / 2));

  const svg = d3.select(DOM.svg(width, height));

  const link = svg.append("g")
      .attr("stroke-opacity", 0.2)
    .selectAll("path")
    .data(links)
    .join("path")
      .attr("stroke-width", 2)
      .attr("stroke", color)
      .attr("fill", "transparent");

  const node = svg.append("g")
      .attr("stroke", "#fff")
      .attr("stroke-width", 0)
    .selectAll("circle")
    .data(nodes)
    .join("circle")
      .attr("r", 6)
      .attr("fill", "#555")
      .call(drag(simulation))
  
  const textElements = svg.append('g')
    .selectAll('text')
    .data(nodes)
    .enter().append('text')
      .text(node => node.id)
      .attr('font-size', 10)
      .attr("font-family", "Helvetica")
      .attr("fill", "#555")
      .attr('dx', 10)
      .attr('dy', 4)

  simulation.on("tick", () => {
    link
      .attr("d", function(d) {
      var dx = d.target.x - d.source.x,
          dy = d.target.y - d.source.y,
          dr = Math.sqrt(dx * dx + dy * dy);
      return "M" + d.source.x + "," + d.source.y + "A" + dr + "," + dr + " 0 0,1 " + d.target.x + "," + d.target.y;
  });
    
    textElements
        .attr("x", node => node.x)
        .attr("y", node => node.y)
    
    node
        .attr("cx", d => d.x)
        .attr("cy", d => d.y);
  });

  invalidation.then(() => simulation.stop());
  
  return svg.node();
}
data = FileAttachment("../../data/cleaned_data/bigrams_bikes.json").json()

height = 600
color = {
  const scale = d3.scaleOrdinal(d3.schemeSet1);
  return d => scale(d.group);
}
drag = simulation => {
  
  function dragstarted(d) {
    if (!d3.event.active) simulation.alphaTarget(0.3).restart();
    d.fx = d.x;
    d.fy = d.y;
  }
  
  function dragged(d) {
    d.fx = d3.event.x;
    d.fy = d3.event.y;
  }
  
  function dragended(d) {
    if (!d3.event.active) simulation.alphaTarget(0);
    d.fx = null;
    d.fy = null;
  }
  
  return d3.drag()
      .on("start", dragstarted)
      .on("drag", dragged)
      .on("end", dragended);
}

Code

d3 = require("d3@7")
d3Cloud = require("d3-cloud@1")
import {howto} from "@d3/example-components"

function WordCloud(title,text, {
  size = group => group.length, // Given a grouping of words, returns the size factor for that word
  word = d => d, // Given an item of the data array, returns the word
  marginTop = 0, // top margin, in pixels
  marginRight = 0, // right margin, in pixels
  marginBottom = 0, // bottom margin, in pixels
  marginLeft = 0, // left margin, in pixels
  width = 640, // outer width, in pixels
  height = 400, // outer height, in pixels
  maxWords = 250, // maximum number of words to extract from the text
  fontFamily = "sans-serif", // font family
  fontScale = 30, // base font size
  padding = 0, // amount of padding between the words (in pixels)
  rotate = 0, // a constant or function to rotate the words
  invalidation // when this promise resolves, stop the simulation
} = {}) {
  const words = typeof text === "string" ? text.split(/\W+/g) : Array.from(text);
  
  const data = d3.rollups(words, size, w => w)
    .sort(([, a], [, b]) => d3.descending(a, b))
    .slice(0, maxWords)
    .map(([key, size]) => ({text: word(key), size}));
  
  const svg = d3.create("svg")
      .attr("viewBox", [0, 0, width, height])
      .attr("width", width)
      .attr("font-family", fontFamily)
      .attr("text-anchor", "middle")
      .attr("fill", "#962e2ec8") 
      .attr("style", "max-width: 100%; height: auto; height: intrinsic;")
      .text(title);

  const g = svg.append("g").attr("transform", `translate(${marginLeft},${marginTop})`);

  const cloud = d3Cloud()
      .size([width - marginLeft - marginRight, height - marginTop - marginBottom])
      .words(data)
      .padding(padding)
      .rotate(rotate)
      .font(fontFamily)
      .fontSize(d => Math.sqrt(d.size) * fontScale)
      .on("word", ({size, x, y, rotate, text}) => {
        g.append("text")
            .attr("font-size", size)
            .attr("transform", `translate(${x},${y}) rotate(${rotate})`)
            .text(text);
      });

  cloud.start();
  invalidation && invalidation.then(() => cloud.stop());
  return svg.node();
}
WordCloud("Bicycle","bike bike bike bike bike bike bike bike bike bike bike bike bike get get get get get get one one one one lanes lanes lanes lanes good good good good trail trail trail trail st st st lane lane lane park park park like like like street street street city city mbt mbt bikes bikes many many ride ride trails trails way way pretty pretty people people th th go go rock rock creek creek nw nw map map take take lock lock want want much much", {
  width: 250,
  height: 100,
  size: () => .3 + Math.random(),
  rotate: () => (~~(Math.random() * 6) - 3) * 30
})

References

Code development was performed by the authors of this project based on the assignments of the 503 course and the next tutorials: - Network D3 - Choropleth Altair - Plotly Scatter - Plotly Boxplot - Plotly histogram - Plotly Distplot - Altair U.S. Airports Connections- Plotly Density plot