1. How is walkabilty associated with socioeconomic outcomes in Washington, D.C.?
1.1 Data Collection
In order to investigate whether walkability has an impact on other aspects of peoples’ lives, we wanted to compare different socioeconomic outcomes of different neighborhoods depending on each one’s walkability. In order to accomplish this, we looked for walkability data, and socioeconomic outcomes that had granularity on the neighborhood level. We were able to accomplish this analysis with the following datasets:
EPA Walkability Index: This dataset compiles different features about a specific census tract1, and computes a walkability score for each census tract. Certain D.C. neighborhoods are comprised of several census tracts, depending on the population density of that neighborhood.2
U.S. Census Bureau Community Resilience Estimates: This dataset compiles different socioeconomic factors about each census block. The ones we chose to analyze were percentage of the households that fall under the poverty threshold, high school completion rates, income inequality, and vehicle ownership.
U.S. Census Tract D.C. GeoJSON: This file is the GeoJSON map file created by the U.S. Census Bureau for Washington, D.C.. It includes a geographical information regarding the location of each neighborhood in Washington, D.C.
1 A census tract is a geographic region defined for the purpose of taking a census. There are 179 census tracts in Washington, D.C. 2 Census tracts generally have a population size between 1,200 and 8,000 people, with an optimum size of 4,000 people. A census tract usually covers a contiguous area; however, the spatial size of census tracts varies widely depending on the density of settlement.
1.2 Data Cleaning
In order to create a clean dataset in the format we wanted, we created a script that subsetted the raw data for D.C. in each data source and only include the columns of interest. We then joined the files by using the Census Block ID column. Most of the columns were percentages, but some of them - like the walkabilty column - were in a score that was comprised of a different range of numbers. In order to ensure that the visualizations would enable easy comparison between each of the factors, each of the outcomes columns was rescaled to have a 0-100 range. Then, we conducted some basic data, including converting missing or negative values to NaN values, renaming columns, and melting data. The JSON data with the geographic boundaries for each DC census tract did not require any cleaning.
Code
from IPython.display import display, HTMLimport altair as altimport pandas as pdimport geopandas as gpdfrom pathlib import Pathimport requestsimport numpy as npimport matplotlib.pyplot as pltimport warningswarnings.simplefilter(action='ignore', category=FutureWarning)"""IMPORT DATA"""# define data directorydata_dir = Path().absolute().parent.absolute().parent/"data"write_data_dir = data_dir/"cleaned_data"img_dir = Path().absolute().parent.absolute().parent/"data"/"img"# import datawalkability = pd.read_csv(data_dir/"joined_depression_cre_walkability.csv")walkability.loc[:, 'geoid_tract_20'] = walkability.geoid_tract_20.astype(str)nation = pd.read_csv(data_dir/"cleaned_data"/"nation-joined_depression_cre_walkability.csv")# # Ingest GEOJSON file of census tracts in DC and grab json# req_dc = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')# json_dc = req_dc.json()# # create geopandas dataframe and add in the walkability / outcomes data# geo_df = gpd.GeoDataFrame.from_features((json_dc))# merged_df = geo_df.merge(walkability,# how = 'left',# left_on = 'GEOID',# right_on='geoid_tract_20')merged_df = walkability.copy(deep=True)"""NORMALIZE SCORES ACROSS ALL METRICS"""# convert the walkability score into a scale from 0 to 100 to make it more easier to interpret# original range 1-20# new desired range: 0-100original_range_min =1original_range_max =20new_range_max =100new_range_min =0merged_df.loc[:, 'walkability_score_scaled'] = merged_df.loc[:, 'walkability_score'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)nation.loc[:, 'walkability_score_scaled'] = nation.loc[:, 'walkability_score'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)# convert the income inequality index score into a scale from 0 to 100 to make it easier to interpret# original range 0-1# new desired range: 0-100original_range_min =0original_range_max =1new_range_max =100new_range_min =0merged_df.loc[:, 'income_inequality_gini_index'] = merged_df.loc[:, 'income_inequality_gini_index'].apply(lambda x: x if x >=0else np.nan)merged_df.loc[:, 'income_inequality_gini_index_scaled'] = merged_df.loc[:, 'income_inequality_gini_index'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)nation.loc[:, 'income_inequality_gini_index'] = nation.loc[:, 'income_inequality_gini_index'].apply(lambda x: x if x >=0else np.nan)nation.loc[:, 'income_inequality_gini_index_scaled'] = nation.loc[:, 'income_inequality_gini_index'].apply(lambda x: ((x - original_range_min) / (original_range_max - original_range_min)) * (new_range_max - new_range_min) + new_range_min)# define columns to reportoutcomes_cols = ['walkability_score_scaled','below_poverty_level_perc','income_inequality_gini_index_scaled','hs_grad_perc','households_no_vehicle_perc']for i in outcomes_cols: merged_df[i] = merged_df[i].apply(lambda x: x if x >=0else np.nan) nation[i] = nation[i].apply(lambda x: x if x >=0else np.nan)# flip metric to be percent of households with a carmerged_df.loc[:, 'households_w_vehicle'] =100- merged_df['households_no_vehicle_perc']nation.loc[:, 'households_w_vehicle'] =100- nation['households_no_vehicle_perc']"""CLEAN COLUMN NAMES"""col_mapping = {'below_poverty_level_perc': '% Below Poverty Level','income_inequality_gini_index_scaled': 'Income Inequality Gini Score','hs_grad_perc': '% HS or Higher Degree','households_w_vehicle': '% with a Vehicle','walkability_score_scaled': 'Walkability Score','neighborhood_name': 'Neighborhood Name'}merged_df = merged_df.rename(col_mapping, axis='columns')"""RE-FORMAT DATA"""# turn the dataframe into long data so that the bar chart can be created with each outcome as a barneighborhood_df = pd.melt(merged_df, id_vars ='Neighborhood Name', value_vars = col_mapping.values())neighborhood_df = neighborhood_df.groupby(['Neighborhood Name', 'variable'])['value'].mean().reset_index()walk_scores =dict(zip(list(neighborhood_df[neighborhood_df.variable=='Walkability Score']['Neighborhood Name']),list(neighborhood_df[neighborhood_df.variable=='Walkability Score']['value']) ))neighborhood_df.loc[:, 'Walkability Score'] = neighborhood_df['Neighborhood Name'].map(walk_scores)# reformat to get the averagesnation = nation[outcomes_cols+['households_w_vehicle']]nation.drop('households_no_vehicle_perc', axis='columns', inplace=True)nation_avg = pd.melt(nation, value_vars = [i for i in col_mapping.keys() if'neighborhood_name'notin i])nation_avg = nation_avg.groupby('variable')['value'].mean().reset_index()# create cleaned column for plotting the national averagesnation_avg['National Average'] = nation_avg['variable'].map(col_mapping)# create DC average walkability scoreneighborhood_df['dc_avg_walk'] = merged_df['Walkability Score'].mean()# add URL to the american flag iconnation_avg['flag_url'] ='https://upload.wikimedia.org/wikipedia/commons/d/de/Flag_of_the_United_States.png'# write data to CSVneighborhood_df.to_csv(write_data_dir/"neighborhood_walkability.csv", index =False)nation_avg.to_csv(write_data_dir/"national_walkability.csv", index =False)merged_df.to_csv(write_data_dir/"cleaned_walkability.csv", index =False)
1.3 Data Visualization
Using the Pandas, Geopandas, and Altair libraries, we generated an interactive visualization of walkability and socio-economic outcomes in Washington, D.C. To generate the interactive visualization, we used the Altair library to create a bar chart with tract or neighborhood names on the x-axis, variable names on the y-axis, and variable values as the bar heights. We further aggregate the data at the national level by calculating the average of the variable values for all neighborhoods, and store the aggregated data in a separate dataframe.
Code
import matplotlib as mpl# define data directoriesdata_dir = Path().absolute().parent.absolute().parent/"data"write_data_dir = data_dir/"cleaned_data"img_dir = Path().absolute().parent.absolute().parent/"data"/"img"# read in cleaned dataneighborhood_df = pd.read_csv(write_data_dir/"neighborhood_walkability.csv")nation_avg = pd.read_csv(write_data_dir/"national_walkability.csv")merged_df = pd.read_csv(write_data_dir/"cleaned_walkability.csv")# Ingest GEOJSON file of census tracts in DC and grab jsonreq_dc = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')json_dc = req_dc.json()# create geopandas dataframe and add in the walkability / outcomes datageo_df = gpd.GeoDataFrame.from_features((json_dc))merged_df.loc[:, 'geoid_tract_20'] = merged_df.geoid_tract_20.astype(str)merged_df = geo_df.merge(merged_df, how ='left', left_on ='GEOID', right_on='geoid_tract_20')"""CREATE VISUALIZATION"""# open custom matplotlib sheetmpl.rc_params_from_file('style_sheet.mplstyle') # load custom style sheet# define a click on the chloropleth map so that it can filter the bar chartclick = alt.selection_multi(fields=['Neighborhood Name'])# create the chloropleth mapchoropleth = (alt.Chart(merged_df, title ="Walkability of DC Census Tracts" ) .mark_geoshape(stroke='white') .transform_lookup( lookup='geoid_tract_20', from_=alt.LookupData(merged_df,'geoid_tract_20', ['Walkability Score', 'Neighborhood Name']) ).encode( alt.Color('Walkability Score:Q', scale=alt.Scale(scheme='redyellowblue', reverse=True ), title ="DC Walkability" ), opacity=alt.condition(click, alt.value(1), alt.value(0.2)), tooltip=['Neighborhood Name:N', 'Walkability Score:Q']) .add_selection(click) )bars = ( alt.Chart(neighborhood_df, title='Outcomes of DC Neighborhoods') .mark_bar() .encode( x = alt.X('variable:N', axis=alt.Axis(labelAngle=-45)), color ='mean(Walkability Score):Q', y = alt.Y('mean(value):Q', sort='x', scale = alt.Scale(domain = [0, 100]) ), tooltip = ['variable:N','mean(value):Q' ] ).properties( width =200, height =300 ).transform_filter(click))# modify the axes and title labelsbars.encoding.y.title ='Avg. Value Across All Census Tracts'bars.encoding.x.title ='Outcome'nation_avg_lines = (alt.Chart(nation_avg) .mark_tick( color="black", thickness=3, size=39, # controls width of tick strokeDash=[1,2] ) .encode( x ='National Average:N', y='value:Q' ))nation_avg_img = (alt.Chart(nation_avg) .mark_image( width=15, height=15) .encode( x='National Average:N', y='value:Q', url='flag_url', tooltip = ['National Average', 'value:Q'] ))alt.themes.enable('vox') # add theme# plot the two graphs togetheralt.hconcat(choropleth, (bars+nation_avg_lines+nation_avg_img))
2. How is walkability associated with health outcomes in Washington, D.C.?
2.1 Data Collection
We obtained PLACES Census Health Data Estimation data. This dataset contains model-based census tract level estimates for PLACES 2022 release. PLACES covers the entire United States at county, place, census tract, and ZIP Code Tabulation Area levels. The estimates, on the other hand, are provided by CDC’s Division of Population Health, Epidemiology and Surveillance Branch. The data sources used are Behavioral Risk Factor Surveillance System (BRFSS) 2020/2019, Census Bureau 2010 population estimates, and American Community Survey (ACS) 2015-2019 estimates. 29 of these measures were mapped at the census tract level using GIS systems.
2.2 Data Cleaning and Exploration
We first read in the PLACES Census Tract Data and conduct some basic cleaning, such as selecting only data related to Washington, D.C. and renaming columns. Then, to do some basic data exploration, we created an exploratory scatter plot using Plotly, where we can examine the relationship between pairs of user-selected health outcomes in D.C. on the x-axis and y-axis. We used a callback function, update_visibility, to update the visibility of the scatter plot traces based on the selected values from the drop-down menus.
Code
# import the csvdc_health_df = pd.read_csv('./PLACES__Census_Tract_Data__GIS_Friendly_Format___2022_release (1).csv')# filter for where StateAbbr = DCdc_health_df = dc_health_df[dc_health_df['StateAbbr'] =='DC']# Resetting defaults and import plotly librariesimport plotly.io as piopio.renderers.default ="browser"import plotly.graph_objects as goimport plotly.express as pximport plotly.io as piopio.renderers.default ="plotly_mimetype+notebook_connected"import statsmodels.api as smimport numpy as npfrom sklearn.metrics import r2_score# isolate only columns with CrudePrev in the namedc_health_df_prev = dc_health_df.filter(regex='CrudePrev')df = dc_health_df_prev# Rename columnsdf = df.rename(columns={'ACCESS2_CrudePrev': '% of Adults without Health Insurance', 'ARTHRITIS_CrudePrev': '% of Adults with Arthritis', 'BINGE_CrudePrev': '% of Adults who Binge Drink','BPHIGH_CrudePrev': '% of Adults with High Blood Pressure','BPMED_CrudePrev': '% of Adults with High Blood Pressure who take Blood Pressure Medication','CANCER_CrudePrev': '% of Adults who were Diagnosed with Cancer','CASTHMA_CrudePrev': '% of Adults who were Diagnosed with Asthma','CERVICAL_CrudePrev': '% of Women who had a Pap Smear in the Past 3 Years','CHD_CrudePrev': '% of Adults who were Diagnosed with Coronary Heart Disease','CHECKUP_CrudePrev': '% of Adults who had a Routine Checkup in the Past Year','CHOLSCREEN_CrudePrev': '% of Adults who had Cholesterol Checked in the Past 5 Years','COLON_SCREEN_CrudePrev': '% of Adults who had a Colonoscopy or similar test in the Past 10 Years','COPD_CrudePrev': '% of Adults who were Diagnosed with COPD (Chronic Obstructive Pulmonary Disease)','COREM_CrudePrev': '% Prevalence of Older Adult Men aged >=65 years who are up to date on preventative health','COREW_CrudePrev': '% Prevalence of Older Adult Women aged >=65 years who are up to date on preventative health','CSMOKING_CrudePrev': '% of Adults who Currently Smoke','DENTAL_CrudePrev': '% of Adults who had a Dental Visit in the Past Year','DEPRESSION_CrudePrev': '% of Adults who were Diagnosed with Depression','DIABETES_CrudePrev': '% of Adults who were Diagnosed with Diabetes','GHLTH_CrudePrev': '% of Adults who reported their Health as not Good','HIGHCHOL_CrudePrev': '% of Adults who were Diagnosed with High Cholesterol','KIDNEY_CrudePrev': '% of Adults who were Diagnosed with Kidney Disease','LPA_CrudePrev': '% of Adults who are Physically Inactive', 'MAMMOUSE_CrudePrev': '% Women aged 50-74 years who had a Mammogram in the Past 2 Years','MHLTH_CrudePrev': '% of Adults who reported their Mental Health as not Good','OBESITY_CrudePrev': '% of Adults who were Obese','PHLTH_CrudePrev': '% of Adults who reported their Physical Health as not Good','SLEEP_CrudePrev': '% of Adults who reported their Sleep as not Good','STROKE_CrudePrev': '% of Adults who were Diagnosed with Stroke','TEETHLOST_CrudePrev': '% of Adults who have lost all of their Natural Teeth'})# list of health metrics for drop down menucolumn_names = df.columns# Creating the initial scatter plotfig = go.Figure(go.Scatter(x=df[column_names[0]], y=df[column_names[1]], mode='markers'))# Label axesfig.update_xaxes(title_text='X Axis')fig.update_yaxes(title_text='Y Axis')# Setting the range for x and y axesfig.update_xaxes(range=[0, 100])fig.update_yaxes(range=[0, 100])for col in column_names:for col2 in column_names: x = df[col] y = df[col2] fig.add_trace(go.Scatter(x=x, y=y, mode='markers', name=col +' vs '+ col2, showlegend=False, visible=False))# Update the visibility of the tracesdef update_visibility(selected_col, selected_col2):for i, trace inenumerate(fig.data):if trace.name == selected_col +' vs '+ selected_col2: trace.visible =Trueelif trace.name == selected_col +' vs '+ selected_col2 +' Best Fit': trace.visible =Trueelse: trace.visible =False# Create the drop-down menus for x (col) and y (col2) axes of the scatter plotcol_dropdown = [{'label': col, 'value': col} for col in column_names]col2_dropdown = [{'label': col2, 'value': col2} for col2 in column_names]# #Define the dropdown menu for x-axisbutton_layer_1_height =1.08x_axis_dropdown = go.layout.Updatemenu( buttons=list([dict(args=[{'x': [df[col]]}, update_visibility(col, col2)], label=col, method='update') for col in column_names]), direction="down", pad={"r": 10, "t": 10}, showactive=True, x=0.06, xanchor="left", y=button_layer_1_height +0.05, yanchor="top")# Define the dropdown menu for y-axisy_axis_dropdown = go.layout.Updatemenu( buttons=list([dict(args=[{'y': [df[col2]]}, update_visibility(col, col2)], label=col2, method='update') for col2 in column_names]), direction="down", pad={"r": 10, "t": 10}, showactive=True, x=0.06, xanchor="left", y=button_layer_1_height, yanchor="top")# Update the layout to include the dropdown menusfig.update_layout( updatemenus=[x_axis_dropdown, y_axis_dropdown])# Label axesfig.update_xaxes(title_text='X Axis')fig.update_yaxes(title_text='Y Axis')# Setting the range for x and y axesfig.update_xaxes(range=[0, 100])fig.update_yaxes(range=[0, 100])# Update plot sizingfig.update_layout( width=900, height=900, autosize=False,#margin=dict(t=100, b=0, l=0, r=0),)# add annotationsfig.update_layout( annotations=[dict( text="X Axis:", x=0, xref="paper", y=button_layer_1_height +0.025, yref="paper", align="left", showarrow=False ),dict( text="Y Axis:", x=0, xref="paper", y=button_layer_1_height -0.025, yref="paper", align="left", showarrow=False ) ])# Change background color to greyfig.update_layout( plot_bgcolor='rgb(230, 230, 230)')# Change scatter point color to redfig.update_traces( marker=dict(color='red'))# Change font to Calibrifig.update_layout( font=dict(family='Proxima Nova'))# Display the scatter plot with dropdown menusfig.show()
2.3 Data Visualization
Based on this we created a second graph where we explore the relationship between walkability and health outcomes. We created two drop-down menus using Plotly’s dcc.Dropdown. The selected values from the drop-down menus are passed to the update_visibility function, which then updates the scatter plot based on the selected health metrics. We then displayed the final plot using Plotly’s dcc.Graph.
3. How accessible are neighborhoods in Washington, D.C. by bike?
3.1 Data Collection
Bike lanes and the Capital Bikeshare program are two major initiatives for improving quality of life and transportation access in D.C. by the D.C. Department of Transportation. Since these two initiatives go hand in hand, we decided to analyze them together in a single visualization. For this analysis we collected the following datasets:
Capital Bikeshare Trip Data for March 2023: This dataset compiles all of the Capital Bikeshare trips completed by riders in the Washington D.C. metro area. It contains information such as start_at: starting bike dock, ended_at: ending bike dock, start_station_name: street name information for starting dock station, end_station_name: street name information for end dock station. It also includes the latitude and longitude information for the start and end locations of the rider’s trip.
U.S. Census Tract D.C. GeoJSON: This file is the GeoJSON map file created by the U.S. Census Bureau for Washington, D.C.. It includes a geographical information regarding the location of each neighborhood in Washington, D.C.
Open Data DC Bike Lane GeoJSON: This GeoJSON file provided by the D.C. Government provides geographical information regarding the various bike lanes located throughout the city.
3.2 Data Cleaning
Since the goal of this visualization was the show both the bike lane and bikeshare data on a single map, we cleaned the datasets to put them in the proper format. In the Capital Bikeshare dataset, each trip’s latitude and longitude was recorded from the point where the bike was activated at the dock. This meant that even though various rides started at the same bike dock, there were small variations in the recorded latitudes and longitudes. Thereforem, the latitude and longitude values were standardized for each bike station. Additionally, the bikeshare dataset contained data for the entire Washington D.C. metro area and since we focused our analysis to D.C. only, stations outside of the D.C. boundary were removed. Finally, any rows with incomplete trip values were removed.
The GeoJSON files did not undergo any processing and therefore were used as-is.
Code
import pandas as pdimport numpy as npimport altair as altimport plotly.graph_objects as gofrom vega_datasets import dataimport requestsimport jsonimport warningsfrom pathlib import Pathwarnings.filterwarnings('ignore')# Read in datadata_dir = Path().absolute().parent.absolute().parent/"data"img_dir = Path().absolute().parent.absolute().parent/"data"/"img"bikeshare_df = pd.read_csv(data_dir/"202303-capitalbikeshare-tripdata.csv")# Convert dates into datetime formatbikeshare_df['started_at'] = pd.to_datetime(bikeshare_df['started_at'])bikeshare_df['ended_at'] = pd.to_datetime(bikeshare_df['ended_at'])# Drop rides with NaN valuesbikeshare_df.dropna(subset=['start_station_name'], inplace =True)bikeshare_df.dropna(subset=['end_station_name'], inplace =True)# Standardize longitude and latitude using start stationbikeshare_df['start_lng'] = bikeshare_df['start_lng'].groupby(bikeshare_df['start_station_id']).transform('max')bikeshare_df['start_lat'] = bikeshare_df['start_lat'].groupby(bikeshare_df['start_station_id']).transform('max')# Create dataframe for joiningtmp = bikeshare_df[['start_station_id', 'start_lng','start_lat']]tmp.drop_duplicates(inplace =True)# Merge using the common station id valuebikeshare_df = bikeshare_df.merge(tmp, left_on ='end_station_id', right_on ='start_station_id')# Drop repeated columns and rename thembikeshare_df.drop(columns = ['end_lat', 'end_lng', 'start_station_id_y'], inplace =True)bikeshare_df.rename(columns = {'start_lat_x': 'start_lat', 'start_lng_x': 'start_lng', 'start_lat_y': 'end_lat', 'start_lng_y':'end_lng', 'start_station_id_x': 'start_station_id'}, inplace =True)bikeshare_df.to_csv(Path().absolute().parent.absolute().parent/"data/cleaned_data/bikeshare_cleaned.csv", index =False)
3.3 Data Visualization
In the Altair plot, we combined March 2023 Capital Bikeshare Data with D.C. geographical data to create a layered visual. The geographical layout is a map of D.C. with each neighborhood outlined by white. Each point represents a Capital Bikeshare Station with the size representing the amount of trips starting from that station in the month of March. The yellow lines represent the bike lanes created by the D.C. government on public streets. When the user hovers over any station, the visual will show black network lines (or edges) that connect that station (“start” station) to other stations (“end” stations), representing where people have traveled to with the Capital Bikeshare bikes. We also included tooltips that display the station information and appear when a user hovers over a station.
Code
import pandas as pdimport numpy as npimport altair as altimport plotly.graph_objects as gofrom vega_datasets import dataimport requestsimport jsonimport warningsfrom pathlib import Pathwarnings.filterwarnings('ignore')bikeshare_df = pd.read_csv(Path().absolute().parent.absolute().parent/"data/cleaned_data/bikeshare_cleaned.csv")# Create list of bikeshare stations outside of DCnondc_stations = [32256,32251,32237,32241,32210,32225,32259,32223,32209,32240,32239,32245,32220,32214,32219,32224,32217,32213,32239,32246,32247,32250,32248,32246,32228,32215,32238,32252,32249,32260,32234,32231,32235,32255,32200,32208,32201,32211,32227,32207,32229,32221,32206,32233,32205,32204,32205,32203,32206,32222,32230,32232,32600,32602,32603,32608,32605,32604,32607,32609,31948,31904,32606,32601,31921,31905,31902,31901,31976,31036,31977,31900,31920,31049,31037,31926,31919,31035,31973,31069,31023,31022,31021,31019,31020,31094,31092,31079,31030,31029,31080,31093,31014,31062,31077,31073,31024,31040,31028,31017,31924,31027,31947,31066,31075,31949,31053,31971,31067,31058,31923,31063,31068,31951,31945,31095,31006,31005,31091,31004,31936,31071,31090,31950,31064,31935,31011,31012,31009,31944,31052,31010,31959,31916,31088,31960,31956,31910,31083,31915,31087,31085,31913,31915,31970,31969,31906,31098,31048,31081,31084,31082,31974,31930,31932,31953,31942,31967,32406,32423,32415,32407,32405,32401,32400,32405,32404,32413,32418,32410,32403,32408,32421,32402,32417,32422,32420,32414,32412,32416,32059,32061,32026,32011,32049,32082,32058,32025,32001,32058,32082,32024,32043,32036,32012,32034,32035,32050,32056,32426,32425,32424,32426,32085,32094,32089,32093,32091,32090,32087,32088,32086,32092,32022,32066,32064,32062,32065,32073,32063,32084,32054,32051,32040,32046,32029,32055,32002,32021,32003,32048,32013,32000,32008,32028,32027,32053,32039,32057,32078,32075,32077,32076,32079,32080,32074,32081,32032,32047,32044,32017,32007,32009,32023,32033,32016,32004,32005,32072,32041,32052,32071,32038,32037,32045,32067,32069,32068,32018,32253,32236,32243,32258,32216,32212,32218,32019,32411,31929,31914,31907,31903,31958,31933,31041,31042,31968,31044,31045,31955,31046,31047,31099,31043,31097,31931,31918,31086,31927,31966,21943,31963,31952,31964,31962,31908,31072,31941,31961,31928,31054,31033,31059,31057,31061,31056,31055,31909,31912,31065,31032,31074,31078,32419,31957,31954,31946,31972,31060,31938,31013,31002,31007,31000,31003,31096,31070,31039,31034,31025,31038,31026,31050,31940,31089,31031,31051,31937,31016,31018,31039,31015,31917,31076,31939,32409]alt.data_transformers.enable('default',max_rows=None)#### BACKGROUND FOR DC MAP # Define background of Washington D.C.response1 = requests.get('https://raw.githubusercontent.com/arcee123/GIS_GEOJSON_CENSUS_TRACTS/master/11.geojson')background = alt.Chart(alt.Data(values=response1.json()), title="Map of D.C. Bike Lanes, Capital Bikeshare Stations, & Routes in March 2023").mark_geoshape( fill="lightgray", stroke='white', strokeWidth=1 ).encode( ).properties( width=600, height=600 )#### BACKGROUND FOR DC BIKE LANE LOCATIONS # Open GeoJSON file for bicycle laneswithopen(data_dir/'Bicycle_Lanes.geojson') as f: data = json.load(f)# Create background of D.C.background_lanes = alt.Chart(alt.Data(values=data)).mark_geoshape( stroke='#d6a320', strokeWidth=1 ).properties( width=600, height=600 )#### MOUSEOVER SELECTION# Create mouseover selectionselect_station = alt.selection_single( on="mouseover", nearest=True, fields=["start_station_name"], empty='none')#### NETWORK CONNECTIONS FOR MAP # Filter non-DC stationstmp1 = bikeshare_df[~bikeshare_df['start_station_id'].isin(nondc_stations)]tmp1 = tmp1[~tmp1['end_station_id'].isin(nondc_stations)]# Keep only relevant columns and drop duplicates to have one row per routetmp1 = tmp1[['start_station_name', 'start_station_id', 'end_station_name', 'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng']].drop_duplicates()# Define connectionsconnections = alt.Chart(tmp1).mark_rule(opacity=0.35).encode( latitude="start_lat:Q", longitude="start_lng:Q", latitude2="end_lat:Q", longitude2="end_lng:Q").transform_filter( select_station)#### POINTS FOR MAP # Filter non-DC stationstmp2 = bikeshare_df[~bikeshare_df['start_station_id'].isin(nondc_stations)]tmp2 = tmp2[~tmp2['end_station_id'].isin(nondc_stations)]# Create boolean columns for rideable type and membet typetmp2['classic_bike'] = np.where(tmp2['rideable_type'] =='classic_bike', 1, 0)tmp2['electric_bike'] = np.where(tmp2['rideable_type'] =='electric_bike', 1, 0)tmp2['docked_bike'] = np.where(tmp2['rideable_type'] =='docked_bike', 1, 0)# Temporary dataframe showing unique station locations with ride counttmp2 = tmp2[['start_station_name','start_station_id', 'start_lng', 'start_lat', 'ride_id', 'classic_bike', 'electric_bike', 'docked_bike']].groupby(['start_station_name', 'start_station_id','start_lng', 'start_lat']).agg({'ride_id': 'count', 'classic_bike': 'sum', 'electric_bike':'sum', 'docked_bike':'sum'}).reset_index()tmp2.rename(columns= {'ride_id':'count_rides', 'classic_bike': 'count_classic', 'electric_bike': 'count_electric', 'docked_bike': 'count_dock'}, inplace =True)tmp2['color'] ='Bike Station'points = alt.Chart(tmp2).mark_circle().encode( latitude="start_lat:Q", longitude="start_lng:Q", color = alt.Color('color:N', title ="Legend", scale = alt.Scale(domain=['Bike Station', 'Bike Lane'],range=['#962e2ec8', '#d6a320'])), size=alt.Size("count_rides:Q", scale=alt.Scale(range=[15, 250]), legend=None), order=alt.Order("count_rides:Q", sort="descending"), tooltip=[ alt.Tooltip('start_station_id:Q', title='\U0001f4cd Start Station ID'), alt.Tooltip('start_station_name:N', title='\U0001f6e3 Start Station Name'), alt.Tooltip('count_rides:Q', title='\U0001f50d Ride Count'), alt.Tooltip('count_classic:Q', title='\U0001f6b4 Classic Bike Count'), alt.Tooltip('count_electric:Q', title='\U0001f6b5 Electric Bike Count'), alt.Tooltip('count_dock:Q', title='\U0001f6b2 Docked Bike Count') ]).add_selection( select_station)alt.themes.enable('vox') # add theme# Show visualization(background + background_lanes + connections + points).configure_view(stroke=None)
4. What is metro ridership in Washington, D.C. like?
4.1 Data Collection
To obtain information about metro ridership, we went to WMATA, which has its own ridership data portal. The data is collected from the Metrorail Ridership Year-over-Year Change dashboard here. Only data for March 2023 is available, so then we went to the download button in the bottom right corner of the platform to downoload this data. For more information about the data collection method, please reach out to “planning_ridership@wmata.com”.
4.2 Data Cleaning
In answering this question, we built an interactive timeseries plot of the Washington Metropolitan Area Transit Authority (WMATA) Metro entries by selected station and date in March 2023 using the Plotly library. Then, we cleaned the data by renaming the ‘Date_This_Year’ and ‘Entry_This_Year’ columns to ‘Date’ and ‘Entries’ respectively for clarity. We also pivoted the dataframe so that number of entries are shown by station in columns and date in rows. Then, we converted the index column, which contains the dates, to datetime format. We also organized the index from earliest to latest date. Finally, we selected the stations Anacostia, Stadium-Armory, Van Ness-UDC, Shaw-Howard Univ, Gallery Place, and Capitol South from the dataframe because the first three neighborhoods correspond to three of the least walkable neighborhoods in Washington, D.C. and the last three neighborhoods correspond to the most walkable neighborhoods. Otherwise, if we included all stations’ data, the plot would be too crowded and confusing. We saved the cleaned data to a csv file.
Code
# Import librariesimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom pathlib import Pathdata_dir = Path().absolute().parent.absolute().parent/"data"write_data_dir = data_dir/"cleaned_data"# Read in datadf = pd.read_csv(data_dir /'wmata.csv', encoding='utf-16', delimiter="\t")# Remove columns Servicetype_This_Year(group), Holiday_Last_Year, Holiday_This_Year, Servicetype_This_Year, Time_Period, and Date_Last_Yeardf = df.drop(['Day_of_Date_This_Year', 'Servicetype_This_Year_(group)', 'Holiday_Last_Year', 'Holiday_This_Year', 'Servicetype_This_Year', 'Time_Period', 'Date_Last_Year', 'Entries_Last_Year'], axis=1)# Rename columns df = df.rename(columns={'Date_This_Year': 'Date'})df = df.rename(columns={'Entries_This_Year': 'Entries'})# Pivot datapivot_df = df.pivot_table(index='Date', columns='Station', values='Entries')# Convert index of pivot_df to datetimepivot_df.index = pd.to_datetime(pivot_df.index)# Organize index of pivot_df from earliest to latest datepivot_df = pivot_df.sort_index()# Select stations Anacostia, Stadium-Armory, Van Ness-UDC, Shaw-Howard Univ, Gallery Place, and Capitol South from pivot_dfnew_df = pivot_df[['Anacostia', 'Stadium-Armory', 'Van Ness-UDC', 'Shaw-Howard U', 'Gallery Place', 'Capitol South']]# weekly rides per station# Save pivot_df and new_df to csvpivot_df.to_csv(write_data_dir /'wmata_cleaned.csv', index=False)new_df.to_csv(write_data_dir /'wmata_new_cleaned.csv', index=False)df.to_csv(write_data_dir /'wmata_long_cleaned.csv', index=False)
4.3 Data Visualization
Using the cleaned data, we built an interactive time series plot using the plotly.graph_objects and plotly.subplots libraries. We read in the cleaned data (wmata_new_cleaned.csv) and created a list of all the unique station names in the dataset. We created a subplot with one trace per station and used a for loop to iterate through each unique station in the dataset, adding the corresponding trace to the subplot. Then, we created a dropdown menu to allow users to input a station from the list of stations available. Each station is represented in a button. When a user picks a station by selecting the corresponding button, the method updates the visibility of the trace to reflect the entries by day at the selected station. The layout of the plot is updated to include the dropdown menu and to set the title, axis labels, and fixed axis ranges. The fig.show() method to display the plot.
Code
# Build interactive timeseries plot using plotly# Import librariesimport plotly.graph_objects as gofrom plotly.subplots import make_subplotsimport matplotlib as mplmpl.rc_params_from_file('style_sheet.mplstyle') # load custom style sheet# List all stationsstations =list(new_df.columns)# Create subplot with one trace per stationfig = make_subplots(rows=1, cols=1)for station in stations: fig.add_trace( go.Scatter(x=new_df.index, y=new_df[station], name=station), row=1, col=1 )# Create dropdown menu to select stationbuttons = []for station in stations: buttons.append(dict(method='update', label=station, args=[{'visible': [station == s for s in stations]}]) )dropdown =dict( active=0, buttons=buttons, direction='down', showactive=True, x=1.1, y=1.1)# Update layoutfig.update_layout( updatemenus=[dropdown], height=600, width=900, title='WMATA Metro Entries by Selected Station and Date in March 2023', xaxis_title='Date', yaxis_title='Entries', yaxis=dict(range=[0, 3000]))# Show plotfig.show()#### Table# Import pivot_df from the filedf = pd.read_csv(write_data_dir /'wmata_long_cleaned.csv')df.Date = pd.to_datetime(df.Date)weekly_df = df.groupby([pd.Grouper(key='Date', freq='W'), 'Station'])['Entries'].sum().reset_index()# sort so that the one with most ridership per week is on topweekly_df = weekly_df.sort_values( by=['Date', 'Entries'], ascending=[True, False])# format the dates so they look cleaner in the tableweekly_df.Date = weekly_df.Date.dt.strftime('%B %-d')# format number of ridesweekly_df['Entries'] = weekly_df['Entries'].apply(lambda x: "{:,}".format(int(x)))headerColor ='#962E2E'# red part of themerowEvenColor ='#FDE6AB'# ligher yellow part of themerowEvenColor2 ='lightgray'# ligher yellow part of themerowOddColor ='white'num_stations = weekly_df.Station.nunique()num_days = weekly_df.Date.nunique()table_fig = go.Figure(data=[go.Table( header=dict( values=['Start of Week', 'Metro Station','Num. of Riders over 7 Days'], line_color='darkslategray', fill_color=headerColor, align=['center'], font=dict(color='white', size=12) ), cells=dict( values=[weekly_df.Date, weekly_df.Station, weekly_df.Entries], line_color='darkslategray',# 2-D list of colors for alternating rows fill_color=[[rowOddColor, rowEvenColor]*(int(len(df)/2)+len(df) %2)],# fill_color = [([rowOddColor]*num_stations+[rowEvenColor]*num_stations)*(int(num_days/2)+ (num_days%2))],# fill_color = [([rowOddColor,rowEvenColor]*(int(num_stations/2)+(num_stations%2))+[rowOddColor,rowEvenColor2]*(int(num_stations/2)+(num_stations%2)))*(int(len(df)/(num_stations*2)) + len(df)%(num_stations*2))], align=['center'], font=dict(color='darkslategray', size=11) ))])table_fig.update_layout(title_text='Weekly D.C. Metro Ridership')table_fig.update_layout({'margin': {'t': 50 }})table_fig.show()# table_fig = go.Figure(data=[go.Table(# header=dict(values=list(df.columns),# fill_color="#962E2E",# align='left'),# cells=dict(values=[df.Date, df.Station, df.Entries],# fill_color='#FDFDFD',# align='left'))# ])# table_fig.show()
5. What is public sentiment around walkability in Washington, D.C.?
5.1 Data Collection
To collect the sentiment of the public regarding different transportation methods we used the Reddit API to scrap answers to posts using keywords such as “Walkability”, “Cars” and “Bicycles”. The API would collect the opinions from the thread URL we are scrapping and finally a dataframe was generated with the content of each post.
Code
import prawimport pandas as pd# Creating connection to API reddit_read_only = praw.Reddit(client_id="", client_secret="", user_agent="Walkability_DC")subreddit = reddit_read_only.subreddit("redditdev")print("Name:", subreddit.display_name)print("Title:", subreddit.title)print("Description:", subreddit.description)# Gather url of predefined topicstopics = {"bikes":[bikes_urls],"cars":[cars_urls],"walk":[walk_urls], }# Scrapping data by urlfor key in topics.keys(): comments = []for url in topics[key][0]:try: submission = reddit_read_only.submission(url=url)except:print("incorrect url:",url)continue submission.comments.replace_more(limit=0)for comment in submission.comments.list(): comments.append(comment.body) topics[key].append(comments)# Generation of dataframe with all threads per categorydf_reddit = pd.DataFrame(data=[topics["bikes"][1],topics["cars"][1],topics["walk"][1]]).Tdf_reddit = df_reddit.rename(columns={0:"bikes", 1:"cars", 2:"walk"})df_reddit.to_csv("../data/reddit.csv",index=False)
5.2 Data Cleaning
A Natural Language Analysis was performed to: remove non-letter characters
Code
import pandas as pdimport refrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport tensorflow as tfimport pandas as pdimport randomimport numpy as npimport torch as torchfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport plotly.graph_objects as goimport plotly.express as pximport plotly.io as piofrom plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotfrom pysentimiento import create_analyzerimport json#opening previously obtained datadf = pd.read_csv('../data/reddit.csv')df = df.fillna("")df.head()# function to clean stringsdef remove_nonletters(strings): pattern = re.compile(r"[^a-zA-Z\s]|[\n]+") cleaned_strings = [(pattern.sub("", i)).lower() for i in strings]return cleaned_stringsfor col in df.columns: col_clean =str(col)+"_clean" df[col_clean] = remove_nonletters(df[col])
After cleaning characters we tokenized the texts to generate wordclous per category.
Code
#Cleaning stopwordsimport nltkfrom nltk import word_tokenize, sent_tokenizefrom nltk.probability import FreqDistfrom nltk.corpus import stopwordsimport res_words=list(stopwords.words('english'))extra_s_words = ["the","to","a","and","of","i","is","you","it","i","is","ive","youve","im","dc","youre","dont","would","also","might"]for i in extra_s_words: s_words.append(i)pattern = re.compile(r'\s*\b('+'|'.join(s_words) +r')\b\s*', re.IGNORECASE)x = ["bikes_clean","cars_clean","walk_clean"]words = {}for col in x: tokens = [word_tokenize(pattern.sub(' ', i)) for i in df[col]] words[col] =" ".join([word for sentence in tokens for word in sentence])words# Obtaining the frequency of each wordfdists= {}for key in words.keys(): tokenized_words = nltk.tokenize.word_tokenize(words[key]) fdist = FreqDist(tokenized_words) fdists[key] = fdistfdists#obtaining the top frequent values and generating string to be consumed by the script to generate the wordcloud vis words_freq = []for key_n in fdists.keys(): fdist = (dict(fdists[key_n].most_common(30))) result ="" total =sum(fdist.values())for key, value in fdist.items(): n = (value*100) // total result+= (key+" ")*n words_freq.append(result)words_freq
Furthermore, to obtain the sentiment of each opinion we used a pre-trained BERT model to predict Negative, Neutral and Positive sentiment.
Code
#Calling a bert tranformeranalyzer = create_analyzer(task="sentiment", lang="es")#predicting sentiment using analyzer for i in ["walk","bikes","cars"]: file_name ="../data/"+i+".txt" df = pd.read_csv(file_name) df = df.dropna col = i+"_clean" sentiment = [analyzer.predict(i) for i in df[col]] sent_label = [i.output for i in sentiment] sent_prob = [i.probas for i in sentiment] df_probs = pd.DataFrame(sent_prob) df_probs["label"] = sent_label df_probs["walk"] = df["walk_clean"] file_name ="../data/"+i+".csv" df_probs.to_csv(file_name)#opening previously modified sentiment analysisb_sent = pd.read_csv("../data/bikes_sent.csv")c_sent = pd.read_csv("../data/cars_sent.csv")w_sent = pd.read_csv("../data/walk_sent.csv")# drop indexb_sent = b_sent.drop(columns=['Unnamed: 0'])c_sent = c_sent.drop(columns=['Unnamed: 0'])w_sent = w_sent.drop(columns=['Unnamed: 0'])# add categoryb_sent["category"] ="bike"c_sent["category"] ="car"w_sent["category"] ="walk"b_sent = b_sent.rename(columns={"bikes": "text"})c_sent = c_sent.rename(columns={"cars": "text"})w_sent = w_sent.rename(columns={"walk": "text"})#generate a single dataframedf_sent = b_sent.append([c_sent,w_sent], ignore_index =True)df_sent.shape
Finally to have a better understanding of the sentiment we want to analyze the bigrams of words in each category to understand the words behind the opinions.
Code
import nltkfrom functools importreduce# Obtaining bigrams and generating a dictionary with their frequencybigram_freqs = []j =0categories = [words[cat] for cat in words.keys()]flattened_text =" ".join([text for text in [words['bikes_clean'],words['cars_clean'],words['walk_clean']]])categories.append(flattened_text)for cat in categories:#obtaining bigrams per category x = pattern.sub(' ',cat ) x = x.replace(" "," ").replace(" "," ") tokens = nltk.word_tokenize(x) bigrams =list(nltk.bigrams(tokens)) y = {i:0for i inset(bigrams)}for bigram in bigrams: y[bigram] +=1 bigram_freqs.append(dict(y))"""To observe the bigrams on a network diagram we want to store the values and their frequency on a json file with the next syntaxis:data = {"nodes" : [{"name":"A Chifolleau","n":1,"grp":1,"id":"A Chifolleau"} "links" =[[{"source":"A Besnard","target":"M Ardisson","value":1},{"source":"A Besnard","target":"Y Holtz","value":1},"""def transform_dict(input_dict): unique_words =set()for key in input_dict.keys(): unique_words.add(key[0]) unique_words.add(key[1]) word_to_id = {word: i+1for i, word inenumerate(sorted(unique_words))}#nodes = [{'name': word, 'value': word_to_id[word], 'colour': '#bde0f6'} for word in sorted(unique_words)] nodes = [{'name': word, 'n': word_to_id[word], "grp":1, "id":word} for word insorted(unique_words)] links = []for key, value in input_dict.items(): source = word_to_id[key[0]] target = word_to_id[key[1]]#links.append({'source': word, 'target': target, 'value': value}) links.append({'source': key[0], 'target': key[1], 'value': value})return {'nodes': nodes, 'links': links}#Function to reduce the dictionary to contain values only with 3 connections or moredef filter_dict(d):return {k: v for k, v in d.items() if v >=3}filtered_bigrams = [filter_dict(i) for i in bigram_freqs]networks =[transform_dict(i) for i in filtered_bigrams]#storing all dictionaries in json filescat = ["bikes", "cars", "walk", "all"]for i inrange(0,len(cat)): filename ="../data/bigrams_"+cat[i]+".json"withopen(filename, 'w') as fp: json.dump(networks[i],fp,indent=4)
5.3 Data Visualization
We generated visualizations using Plotly library in Python to analyze sentiment polarity of comments related to different categories (bike, walk, and car). We first load and clean data from three different CSV files (b_sent.csv, c_sent.csv, and w_sent.csv) which contain sentiment analysis results for reddit comments related to bikes, cars, and walking, respectively. Next, we create three box plots using Plotly, one for each category of transport. We plotted the sentiment polarity values on the y-axis, and the categories on the x-axis. After the first graph, we create a 2D histogram contour plot and two more histograms. The contour plot (trace2) shows the density of sentiment polarity values for each category, with darker areas indicating higher density. The two histograms (trace3 and trace4) show the distribution of sentiment polarity values along the x-axis and categories along the y-axis separately. The final visualizations provide insights into the sentiment polarity of comments related to different categories, allowing for a comparison of sentiment distribution across categories.
d3 =require("d3@7")d3Cloud =require("d3-cloud@1")import {howto} from"@d3/example-components"functionWordCloud(title,text, { size = group => group.length,// Given a grouping of words, returns the size factor for that word word = d => d,// Given an item of the data array, returns the word marginTop =0,// top margin, in pixels marginRight =0,// right margin, in pixels marginBottom =0,// bottom margin, in pixels marginLeft =0,// left margin, in pixels width =640,// outer width, in pixels height =400,// outer height, in pixels maxWords =250,// maximum number of words to extract from the text fontFamily ="sans-serif",// font family fontScale =30,// base font size padding =0,// amount of padding between the words (in pixels) rotate =0,// a constant or function to rotate the words invalidation // when this promise resolves, stop the simulation} = {}) {const words =typeof text ==="string"? text.split(/\W+/g) :Array.from(text);const data = d3.rollups(words, size, w => w).sort(([, a], [, b]) => d3.descending(a, b)).slice(0, maxWords).map(([key, size]) => ({text:word(key), size}));const svg = d3.create("svg").attr("viewBox", [0,0, width, height]).attr("width", width).attr("font-family", fontFamily).attr("text-anchor","middle").attr("fill","#962e2ec8") .attr("style","max-width: 100%; height: auto; height: intrinsic;").text(title);const g = svg.append("g").attr("transform",`translate(${marginLeft},${marginTop})`);const cloud =d3Cloud().size([width - marginLeft - marginRight, height - marginTop - marginBottom]).words(data).padding(padding).rotate(rotate).font(fontFamily).fontSize(d =>Math.sqrt(d.size) * fontScale).on("word", ({size, x, y, rotate, text}) => { g.append("text").attr("font-size", size).attr("transform",`translate(${x},${y}) rotate(${rotate})`).text(text); }); cloud.start(); invalidation && invalidation.then(() => cloud.stop());return svg.node();}WordCloud("Bicycle","bike bike bike bike bike bike bike bike bike bike bike bike bike get get get get get get one one one one lanes lanes lanes lanes good good good good trail trail trail trail st st st lane lane lane park park park like like like street street street city city mbt mbt bikes bikes many many ride ride trails trails way way pretty pretty people people th th go go rock rock creek creek nw nw map map take take lock lock want want much much", {width:250,height:100,size: () =>.3+Math.random(),rotate: () => (~~(Math.random() *6) -3) *30})