The 2019 Women’s FIFA World Cup in France saw unprecedented interest in the tournament, with viewership, attendance, and digital engagement reaching record heights across the globe. Four years later, the Women’s FIFA World Cup is set to happen this July in New Zealand and Australia. Despite progress being made, significant inequality persists between men’s and women’s football. Through our data gathering process, we recognized how challenging it is to find concise information and digestible visualizations on women’s football compared with men’s football. Therefore, we intend to highlight this disparity through our first set of visualizations. These visualizations will cover a range of topics including popular phrases over time, overall spending differences between World Cups historically, and more.
As we approach the Women’s World Cup, we are especially motivated and inspired to gain a deeper understanding of various football characteristics surrounding women’s football. We hope to take you on a visual journey into the past and present of women’s football with our second set of visualizations. This set of visualizations will delve into winning trends over time, game statistics on the field, player attributes, and more relating soley to women’s FIFA. By examining these visualizations, we aim to gain important insights into what is in store for the 2023 Women’s World Cup.
The FIFA World Cup: A Uniting Global Event
The FIFA World Cup is typically regarded as the largest sporting events in the world with extensive media attention. What is the public discussing regarding FIFA and the World Cup?
Popular Phrases: An Insight into the FIFA World Cup
In the past couple of years, Twitter has become an international communication hub for a plethora of topics ranging from international politics to the latest sports news. Twitter’s rise has impacted how official news sources interact with the public, with accounts such as ESPN being able to update fans on the latest scores immediately. Because of this, it was determined that it would be a great source to find text data from three official FIFA news sources to grasp precisely what the FIFA World Cup is. Tweets were pulled from three main accounts, FIFAWorldCup, FIFAWCC (FIFA Women’s World Cup), and FIFAcom. Through Twitter API, tweets were pulled from as close to the past FIFA event as possible. This was done to see what keywords and phrases were being put out by these official accounts to their fans. Through this data collection, it was hoped that the overarching question ‘What exactly is FIFA?’ could be answered.
Code
import pandas as pd import jsonfrom wordcloud import WordCloudimport plotly.graph_objs as goimport warningswarnings.simplefilter('ignore')#function that will create each trace#below function was modified from the following link: # https://github.com/PrashantSaikia/Wordcloud-in-Plotly#changes were made so multiple wordclouds could be made from frequencies, as well as the location of the words remaining static, as well as color. def plotly_wordcloud(text, word_dict): wc = WordCloud(max_font_size=100).generate_from_frequencies(text) word_list=[] freq_list=[]for word_info in wc.layout_: word_list.append(word_info[0][0]) freq_list.append(word_info[0][1])# get the positions x = [] y = [] color_list = []#look in the dictonary to get the locations and colors for i in word_list: position = word_dict[i]['position'] color_list.append(word_dict[i]['color']) x.append(float(position[0])) y.append(float(position[1]))# get the relative occurence frequencies new_freq_list = [freq*100for freq in freq_list] trace = go.Scatter(x=x, y=y, textfont =dict(size=new_freq_list, color=color_list), textposition ='middle center', hoverinfo='text', hovertext=['{0}{1}'.format(w, f) for w, f inzip(word_list, freq_list)], mode='text', text=word_list, visible=False )return tracevec_df = pd.read_csv("../data/FIFAWorldCupvectorized.csv", index_col=0)vec_FIFAcom = pd.read_csv("../data/FIFAcomvectorized.csv", index_col=0)df_vector = pd.read_csv("../data/FIFAWCCvectorized.csv", index_col=0)# Load the JSON filewithopen('../data/dict_1.json') as f: dict_1 = json.load(f)# Load the JSON filewithopen('../data/dict_2.json') as f: dict_2 = json.load(f)# Load the JSON filewithopen('../data/dict_3.json') as f: dict_3 = json.load(f)#need to generate a trace for each column date_1 =list(vec_df.columns)date_2 =list(vec_FIFAcom.columns)date_3 =list(df_vector.columns)dataframes = [vec_df, vec_FIFAcom, df_vector]wdict_list = [dict_1, dict_2, dict_3]# initialize the fig fig = go.Figure()for i, dataframe inenumerate(dataframes): #creating the traces for the current twitter account: for k inrange(len(dataframe.columns)-1): text = dataframe.iloc[:, k] word_dict = wdict_list[i] trace = plotly_wordcloud(text, word_dict) fig.add_trace(trace)fig.data[0].visible =Truefig.update_layout(template="plotly_white")fig.update_layout(xaxis=dict(showline=False, zeroline=False), yaxis=dict(showline=False, zeroline=False))matrix_false = []for i inrange(len(fig.data)): x = [False] *len(fig.data) x[i] =True matrix_false.append(x)#creating the dictonaries: dic_sliders = {}step_count =0account_dict = [('FIFAWorldCup', len(date_1)-1), ('FIFAcom',len(date_2)-1),('FIFAWWC',len(date_3)-1)]for val, j in account_dict: steps = []for k inrange(j): step = {'method': 'update','args': [{'visible': matrix_false[step_count]}]} steps.append(step) step_count +=1 dic_sliders[val] = [dict(active=0, currentvalue={"prefix": "Date: "}, pad={"t": 50}, steps=steps)]fig.update_layout(sliders = dic_sliders['FIFAWorldCup'] ,xaxis=dict(range=[-1,7], tickvals=[], ticktext=[] ), yaxis=dict(range=[-2, 8], tickvals=[], ticktext=[] ))fig.update_layout( updatemenus=[dict( buttons=list([dict( label="FIFAWorldCup", method="update", args=[{"visible": matrix_false[0]}, {"sliders" : dic_sliders['FIFAWorldCup']}] , ),dict( label='FIFAcom', method="update", args=[{"visible": matrix_false[23]}, {"sliders" : dic_sliders['FIFAcom']}] , ), dict( label='FIFAWWC', method="update", args=[{"visible": matrix_false[57]}, {"sliders" : dic_sliders['FIFAWWC']}] , ) ]), direction="down", showactive=True, pad={"r": 1, "t": 1}, x=0, xanchor="left", y=2, yanchor="top" ), ], plot_bgcolor ="#F1F0DA", paper_bgcolor ="#F1F0DA", title ="Key Words From Official Tweets During the 2022 World Cup", title_x =0.5, annotations=[dict( x =0, y =0, showarrow =False, text ='Data Source: Tweets Pulled From @FIFAWorldCup, @FIFAWWC, @FIFAcom [5]', xref ="paper", yref ="paper", font =dict( size =10, color ="black" ) ) ], width=900, height=500,)fig.show()
An interactive word cloud was created using Plotly, where the drop-down menu allows the user to specify what account they want to focus on the words, while the slider shows the accounts’ main phrases as the FIFA 2022 World Cup took place. The user would see many vital terms such as Qatar, medal, world, and goal, as well as other phrases to show that this event is truly an international experience. On specific days, the user can see which players are considered key, such as Messi and Mbappe.
A Game of Inequality: Highlighting Gender Disparity in FIFA
Even today, the inequality between men’s and women’s football persists especially in the areas of funding, coverage, salary, infrastructure, and more. With the visualizations in this section, we hope to not only explore but emphasize this disparity.
Financial Spending: How Men’s and Women’s FIFA Differ
Qatar made headlines with an exorbitant amount spent with the latest World Cup. However, it is common for countries to splurge when selected for the Men’s FIFA World Cup. With the intake in tourism and other factors, hosting the World Cup has its economic advantages. The plot below attempts to map this relationship between the amount a country spent and the total, average, and highest number of people who attended the World Cup.
Code
import pandas as pdimport altair as altimport warningswarnings.simplefilter('ignore')df = pd.read_csv('../data/finan_data_long.csv')df2 = pd.read_csv('../data/finan_data.csv', index_col=0)df2['country_year'] = df2['country'] +','+ df2['year'].astype(str)df['country_year'] = df['country'] +','+ df['year'].astype(str)df3 = df2[df2['country'] !='Qatar']selection = alt.selection_single(fields=['country_year'],name='Random')#color selection for bar plotlist_country = ['Qatar,2022','France,2019','Russia,2018','Canada,2015','Brazil,2014','Germany ,2011','South Africa,2010','China,2007', 'Germany ,2006','USA,2003','S. Korea/Japan,2002', 'USA,1999', 'France,1998','USA,1996']list_country.reverse()color = alt.condition(selection, alt.value('#009643'), alt.value('lightgray'), )#bar plot for the amount spent (including Qatar)bar1 = (alt.Chart(df2).mark_bar(size =20) .encode( y=alt.Y('amount:Q'), x=alt.X('country_year:N', sort = list_country), color = color ).add_selection(selection)).properties(width=400,height=565, title = {'text' : ["Spending By Each Country"],"fontSize": 18,'subtitle': ["",""], })#bar customization bar1.encoding.x.title ='Year and Country FIFA Was Held'bar1.encoding.y.title ='Money Spent (Billions USD)'#the second bar plot not including Qatar: bar2 = (alt.Chart(df3).mark_bar(size =20) .encode( y=alt.Y('amount:Q'), x=alt.X('country_year:N', sort = list_country), color = color ).add_selection(selection)).properties(width=400,height=565, title = {"text": ["Spending By Each Country"] ,"fontSize": 18,"subtitle": ["(Not Including Qatar)","*units different than previous plot"], })#bar customization bar2.encoding.x.title ='Year and Country FIFA Was Held'bar2.encoding.y.title ='Money Spent (Billions)*'#color selection 2 for the bubble plotcolor2 = alt.condition(selection, alt.Color('Gender:N', scale = alt.Scale(range= ['#0099FF', '#FD5109'])), alt.value('lightgray'))#creating the selection boxcategory_select = alt.selection_single(fields=['type'], bind=alt.binding_select(options=df['type'].unique()))#creating the circles: plot = alt.Chart(df).mark_circle().encode( x=alt.X('x:Q', axis =None), y=alt.Y('y:Q', axis =None), size=alt.Size('area:Q', scale=alt.Scale(domain=[0, 0.8], range=[0, 130000]), legend=None), color=color2, tooltip=['country:N', 'year:N', 'attendance:Q'], ).transform_filter( category_select).properties( width=750, height=565, title = {"text": ["","Attendance for Each Country", ""],"fontSize": 16 })#creating the text so the bubbles are labeledtext = alt.Chart(df).mark_text(align='center', baseline='middle', color ='#F1F0DA', fontSize=14).encode( x=alt.X('x:Q', axis =None), y=alt.Y('y:Q', axis =None), text='country:N').transform_filter( category_select)source = alt.Chart().mark_text(align='center', baseline='middle', fontSize=14).encode().properties( title = {'text': ["", "", "", "", "Data Source: Wikipedia [4] and FIFA Financial Reports [6]"],'fontWeight': 'normal','fontSize': 12})bar_comb = alt.concat(bar1, bar2)# create chart1 without configuration settings - this is so i can work the color selection/linked#chart1 is the combined plot and textchart1 = alt.layer(plot, text)#the final chart, this is so we are will add all the configurations herechart = alt.VConcatChart(vconcat=[bar_comb, chart1, source], title=alt.TitleParams(text=['Country Spending Compared to Game Attendance', ""], anchor='middle', fontSize=24), background ='#F1F0DA', center =True, spacing =10, padding = {"left": 40, "top": 30, "right": 40, "bottom": 60}, config={'view': {'stroke': '#F1F0DA','fill': '#F1F0DA' },'axis': {'grid': False, 'labelFontSize': 14,'titleFontSize': 15,'labelAngle' : -45 },'legend': {'titleFontSize' :16,'labelFontSize': 12,'orient': 'bottom' } })#adding the selection so that way the blob/text will be changed, we add it to the final plotchart.add_selection(category_select)
The ‘total’ option reflects the sum attendance of all the stadium events, ‘average’ shows the number of people compared to the number of events, and ‘highest’ is the top number of people at one stadium during the overall tournament. Although we see that in the total attendance, Men’s FIFA is higher when looking at the average and highest, Men’s and Women’s FIFA are closer. However, when comparing the amount spent with the average and highest attendance, there is a massive discrepancy between the genders. Although one might attribute this maybe each country wanting to spend only a set amount no matter the gender, it is seen that France held the Men’s FIFA in 1998 and spent 2 billion on preparing for the games, while for the Women’s World Cup in 2019, they spent less than one-fourth of that. The goal of this visualization is so that the viewer can see the gender disparity between the two World Cups. To accurately understand the attendance, hover over the circles to see the number of people.
The Pay Gap: Comparing Men’s and Women’s Player Salaries
One of the first areas that comes to mind when people usually think about gender disparity in not only football, but most other sports as well, is the pay gap. In the visualization below, we hope to bring light to how drastic the difference in salary is between the best men and women’s football players just in the last year. We have also included the table used for the plot below.
Code
# Read in packagesimport pandas as pdimport altair as altimport numpy as npimport warningswarnings.simplefilter('ignore')warnings.simplefilter(action='ignore', category=FutureWarning)# Get the datasalariestable = pd.read_excel("../data/salary_comparison.xlsx")salariestable.columns = [x.title() for x inlist(salariestable.columns)]# Convert all columns to strings (used for text later)salariestable = salariestable.astype(str)salariestable.index +=1# Change column namesfrom IPython.display import Markdownfrom tabulate import tabulateMarkdown(tabulate( salariestable, headers = ["Player Name", "Salary", "Year", "Gender", "Country"]))
Table 1: Salaries of the Top Nine Men and Female Football Players 2022-2023
Player Name
Salary
Year
Gender
Country
1
Cristiano Ronaldo
200000000
2022
Male
Portugal
2
Kylian Mbappé
110000000
2022
Male
France
3
Lionel Messi
65000000
2022
Male
Argentina
4
Neymar
55000000
2022
Male
Brazil
5
Mohamed Salah
35000000
2022
Male
Egypt
6
Erling Haaland
35000000
2022
Male
Norway
7
Robert Lewandowski
27000000
2022
Male
Poland
8
Eden Hazard
27000000
2022
Male
Belgium
9
Andres Iniesta
25000000
2022
Male
Spain
10
Sam Kerr
513000
2023
Female
Australia
11
Alex Morgan
450000
2023
Female
United States
12
Magan Rapinoe
447000
2023
Female
United States
13
Julie Ertz
430000
2023
Female
United States
14
Ada Hegerberg
425000
2023
Female
Norway
15
Marta Vieira
400000
2023
Female
Brazil
16
Amandine Henry
394000
2023
Female
France
17
Wendie Renard
392000
2023
Female
France
18
Christine Sinclair
380000
2023
Female
Canada
Code
# Load packagesimport pandas as pdimport altair as altimport numpy as npimport warningswarnings.simplefilter('ignore')warnings.simplefilter(action='ignore', category=FutureWarning)# Get the datasalaries = pd.read_excel("../data/salary_comparison.xlsx")salaries.columns = [x.title() for x inlist(salaries.columns)]female_salaries = salaries[salaries["Gender"] =="Female"]male_salaries = salaries[salaries["Gender"] =="Male"]# Prepare the datafemale_salaries.reset_index(drop =True, inplace =True)male_salaries.Salary = male_salaries.Salary.astype(int)female_salaries.Salary = female_salaries.Salary.astype(int)# Make our custom color schemecolor_scheme = ["#009643", "#CB4349"]# Add selection fieldsselection = alt.selection_single(fields = ["Player"])both = alt.Chart(salaries).mark_bar().encode( x = alt.X("Salary:Q", title ="Salary (USD)", sort ="descending", scale = alt.Scale(domain = (200500000, 0))), y = alt.Y("Player:N", axis = alt.Axis(title ="Player Name", orient ="left"), sort = alt.Sort(field ="Salary", order ="descending")), color = alt.Color("Salary:Q", scale = alt.Scale(range= color_scheme))).properties( title = ["Pay Gap Between the Highest Paid Men's and Women's Football Players"], width =900, height =225)text1 = both.mark_text( align ="left", baseline ="middle", dx =1).encode( text ="Country:N")both_final = (both + text1)# Add caption for when salaries were collectedboth_final = alt.concat(both_final).properties(title = alt.TitleParams( ["**Note: men's salaries are from December 2022 and women's salaries are from March 2023**", " "], baseline ="top", orient ="top", anchor ="end", fontWeight ="normal", fontSize =11 ))# Men's chartmen_chart = alt.Chart(male_salaries).mark_bar().encode( x = alt.X("Salary:Q", title ="Salary (USD)", sort ="descending", scale = alt.Scale(domain = (0, 230000000))), y = alt.Y("Player:N", axis = alt.Axis(title ="Player"), sort = alt.Sort(field ="Salary", order ="descending")), color = alt.Color("Salary:Q", scale = alt.Scale(range= color_scheme))).properties( title = ["Highest Paid Male Football Players"], width =390, height =225)textm = men_chart.mark_text( align ="right", baseline ="middle", dx =-3).encode( text ="Country:N")men_chart_final = (men_chart + textm)women_chart = alt.Chart(female_salaries).mark_bar().encode( x = alt.X("Salary:Q", title ="Salary (USD)", sort ="descending", scale = alt.Scale(domain = (600000, 0))), y = alt.Y("Player:N", axis = alt.Axis(title ="Player Name", orient ="right"), sort = alt.Sort(field ="Salary", order ="descending")), color = alt.Color("Salary:Q", scale = alt.Scale(range= color_scheme))).properties( title = ["Highest Paid Female Football Players (ZOOMED IN)"], width =390, height =225)textw = women_chart.mark_text( align ="left", baseline ="middle", dx =3).encode( text ="Country:N")women_chart_final = (women_chart + textw)# Add caption with data sourcewomen_chart_final = alt.concat(women_chart_final).properties(title = alt.TitleParams( [" ", " ", "Data Source: Statista [9] and AS USA [10]"], baseline ="bottom", orient ="bottom", anchor ="end", fontWeight ="normal", fontSize =11 ))bottom = alt.hconcat(men_chart_final , women_chart_final, spacing =0)final = alt.vconcat(both_final, bottom)final.configure(background ="#F1F0DA").configure_title(fontSize =15)
From the two plots on top, it is clear that women’s player salaries are drastically less than the men’s. When plotted on the same scale, we are not even able to see any bars show up on the women’s side because their salaries are so low in comparison. To put it into perspective, the highest paid female player, Sam Kerr, made 47 times less this last year than the 9th highest paid male player, Andres Iniesta. While this pay gap has recieved a lot of attention in recent years, there is, unfortunately, much more that needs to be done to bridge this inequality. On other important note to make is how difficult it was to find recent data on female players’ salaries. It is crucial to give more attention to this disparity in order to facilitate meaningful change.
Infrastructure: Exploring Match Locations for Women’s Tournaments
Regarding the many different tournaments leading to the FIFA World Cup, there is a particular trend in the location selection for said tournaments. There are three main tournaments preceding the FIFA World Cup, which is considered to be the main event for women’s football. These three tournaments are:
FA Women’s Super League
National Women’s Soccer League (NWSL)
UEFA Women’s Euro
First, we want to see which countries and specific stadiums are preferred for which competitions. This will act as the basis for the next step, which is identifying which countries have the most considerable amount of players with the highest rankings.
For this visualization, we look at the period between 2018 and 2022. FA Women’s Super Leauge has the most data by having the different stadium locations for 2018, 2019, 2020, and 2021. Already by the name, we can expect the National Women’s Soccer League to have preferred stadiums in the United States when hosting the FIFA World Cup. In fact, the FIFA World Cup location changes the most, as it depends on the host country. For example, in 2019, the FIFA World Cup was hosted in France, with its main stadium being the Parc Olympique Lyonnais, also called the Groupama Stadium in Décines-Charpieu, France.
Code
#IMPORTING LIBRARIESimport plotly.graph_objects as goimport plotly.io as pioimport numpy as npimport pandas as pdpio.renderers.default ="plotly_mimetype+notebook_connected"# Importing datadf = pd.read_csv("../data/clean_matches.csv")# CHANGE DATA TYPE OF YEAR TO STRING SO IT CAN BE ITERABLEdf["year"] = df["year"].astype("str")# Create subsets of data to create tracesdf1_2018 = pd.DataFrame(df[(df['competition_name'] =='FA Womens Super League') & (df['year'] =='2018')])df1_2019 = pd.DataFrame(df[(df['competition_name'] =='FA Womens Super League') & (df['year'] =='2019')])df1_2020 = pd.DataFrame(df[(df['competition_name'] =='FA Womens Super League') & (df['year'] =='2020')])df1_2021 = pd.DataFrame(df[(df['competition_name'] =='FA Womens Super League') & (df['year'] =='2021')])df2 = pd.DataFrame(df[df['competition_name']=='NWSL'])df3 = pd.DataFrame(df[df['competition_name']=='UEFA Womens Euro'])df4 = pd.DataFrame(df[df['competition_name']=='Womens World Cup'])# TRACE-1: trace1 = ( go.Scattergeo( lat=df1_2018['lat'], lon=df1_2018['long'], text=df1_2018['text'], mode='markers', marker=dict( size=df['freq'], color='#FCC92B', opacity=0.8, symbol='circle'), name ="FA Women's Super League 2018", visible=True))# TRACE-2: trace2 = ( go.Scattergeo( lat=df1_2019['lat'], lon=df1_2019['long'], text=df1_2019['text'], mode='markers', marker=dict( size=df['freq'], color='#009643', opacity=0.8, symbol='circle'), name ="FA Women's Super League 2019", visible=True))# TRACE-3: trace3 = ( go.Scattergeo( lat=df1_2020['lat'], lon=df1_2020['long'], text=df1_2020['text'], mode='markers', marker=dict( size=df['freq'], color='#009AFE', opacity=0.8, symbol='circle'), name ="FA Women's Super League 2020", visible=True))# TRACE-4: trace4 = ( go.Scattergeo( lat=df1_2021['lat'], lon=df1_2021['long'], text=df1_2021['text'], mode='markers', marker=dict( size=df['freq'], color='#FF818D', opacity=0.8, symbol='circle'), name ="FA Women's Super League 2021", visible=True))# TRACE-5: trace5 = ( go.Scattergeo( lat=df2['lat'], lon=df2['long'], text=df2['text'], mode='markers', marker=dict( size=df['freq'], color='#FF5108', opacity=0.8, symbol='circle'), name ="National Women's Soccer League 2018", visible=True))# TRACE-6: trace6 = ( go.Scattergeo( lat=df3['lat'], lon=df3['long'], text=df3['text'], mode='markers', marker=dict( size=df['freq'], color='#8538B1', opacity=0.8, symbol='circle'), name ="UEFA Womens Euro 2022", visible=True))# TRACE-7: trace7 = ( go.Scattergeo( lat=df4['lat'], lon=df4['long'], text=df4['text'], mode='markers', marker=dict( size=df['freq'], color='#960505', opacity=0.8, symbol='circle'), name ='Womens World Cup 2019', visible=True))# COMBINING TRACEStraces = [trace1, trace2, trace3, trace4, trace5, trace6, trace7]# INITIALIZE GRAPH OBJECTfig = go.Figure(data=traces)# VARIABLES FOR BUTTON LOCATIONbutton_height =0.15x1_loc =0.00y1_loc =1# ADD ANNOTATIONfig.add_annotation( x=0, y=0, text='Data Source: StatsBomb [3], Wikipedia [4]', showarrow=False, visible=True,)#DROPDOWN MENUSfig.update_layout( title ='Location of Stadiums by Tournament and Number of Matches Played', plot_bgcolor='#F1F0DA', paper_bgcolor='#F1F0DA',#annotations=annotations, geo=dict( lonaxis=dict(range=[-130, 20], ), lataxis=dict(range=[20,65], ), showocean=True, oceancolor='#F1F0DA' ), updatemenus=[dict( buttons=[dict( label="FA Women's Super League", method="update", args=[{"visible": [True, True, True, True, False, False, False]} ] ),dict( label="National Women's Soccer League 2018", method="update", args=[{"visible": [False, False, False, False, True, False, False]} ] ),dict( label="UEFA Womens Euro 2022", method="update", args=[{"visible": [False, False, False, False, False, True, False]} ] ), dict( label="Womens World Cup 2019", method="update", args=[{"visible": [False, False, False, False, False, False, True]} ] ) ], direction="down", showactive=True, pad={"r": 10, "t": 10}, x=x1_loc, y=y1_loc, xanchor="left", yanchor="top" ) ], width =900)# SHOW FIGUREfig.show()
Overall, the stadium distribution for all these tournaments is more concentrated in Europe. The one obvious exception, as mentioned earlier, is the National Women’s Soccer League, which has matches strictly in the United States. Therefore, most stadiums are located on the east coast, with the highest number of matches played on the east coast. However, a few stadiums are also on the west coast, and very few are in the Midwest and South.
For the FA Women’s Super League, we can see that different stadiums were selected for various years; however, all of them are located in the United Kingdom, specifically England. The marker size represents the number of matches played at the stadium but for each tournament separately. If we were to compare the FA Women’s Super League with the National Women’s Soccer League, the marker sizes wouldn’t be comparable since the highest number of matches played at any National Women’s Soccer League Stadium is 5, while for the FA Women’s Super League is over 30. Therefore, we added a trace text, allowing the user to hover over any stadium and see how many matches were played there. For the FA Women’s League, we can see the four major football cities with major football clubs in England: London, Manchester, Liverpool, and Birmingham, where Manchester and Liverpool are close enough to overlap on the map. We can also see some stadiums west of London. We would expect these to be the other major cities in England, such as Southampton, Bath, or Bournemouth. Once we hover over these stadiums, we can see that many are in smaller towns that might not be as well known, which is also true for some of the UEFA Women’s Euro Tournament.
Similarly, like for the FA Women’s League and the UEFA Women’s Euro, we can see that the stadium distribution is highly concentrated in England. We can again see clubs in the London and Manchester/Liverpool area; however, yet again, once we hover over the separate stadiums, they are usually located in a smaller city near one of the major cities.
Finally, for the FIFA Women’s World Cup, the host country changes every year, and in 2019, as mentioned before, France hosted the tournament. Therefore, we can see an equal distribution geographically across all of France. However, from the map, we can identify the stadiums around the major French cities, such as Paris in the north of France, Lyon in the central south, and Marseille on the south coastline. West of Marseille, we can also see Montpellier, which has one of France’s bigger stadiums. One of the interesting facts about the Women’s World Cup is that the research shows the Groupama Stadium in Décubes-Charpieu was supposed to be the main stadium for this tournament; however, only three matches have been played there. In comparison, seven matches have been played at the Parc de Princess stadium in Paris. Hovering over the stadiums, we see once again that many of the locations are near major cities, but only a few are actually in the city.
To conclude, when it comes to the location of the stadiums, there is a trend that many of the football stadiums are located close to large cities, but they are not the large stadiums, which will likely host the same tournaments only for men. There is a stigma regarding funding sports events for men and women. Women’s professional sports have been known to be underfunded, which can be one of the reasons for such choices of smaller and less fancy stadiums. However, this hypothesis requires further analysis.
The Rise of Women’s FIFA: Celebrating the Players and their Skill
Now that we have looked more generally into the FIFA World Cup and highlighted the persistent inequality between men’s and women’s football, we would like to focus solely on these women and their teams leading up to the 2023 Women’s World Cup. These visualizations will showcase the performance of the teams and players during past World Cup games, while specifically focusing on the key players set to participate in the upcoming tournament.
Geographical Dynamics: Key Players for the 2023 Women’s World Cup
As previously discussed, the Women’s World Cup is a very high-scale and anticipated event this year. There are many players who are currently at the top of their game, and we would like to showcase these players. In the summer of 2022, ESPN came out with a list of the top 50 players to watch out for in this next World Cup [2]. The table below details the exact list from the article in the order that ESPN ranked the players. The choropleth globe below is based off of this competitive list to gain insight into which countries these key players will be representing this July.
Code
import warningswarnings.simplefilter('ignore')warnings.simplefilter(action='ignore', category=FutureWarning)# Read in packagesimport pandas as pdimport numpy as npimport plotlyimport requestsimport plotly.graph_objects as gofrom plotly.offline import plotimport io# Read in datatop50table = pd.read_csv("../data/top50_women_espn.csv")# Convert all columns to strings (used for text later)top50table = top50table.astype(str)top50table.index +=1# Change column namesfrom IPython.display import Markdownfrom tabulate import tabulateMarkdown(tabulate( top50table, headers=["Name", "Country", "Club", "Age", "Position", "Rank"]))
Table 2: ESPN’s List of the Top 50 Upcoming Players
Name
Country
Club
Age
Position
Rank
1
Alexia Putellas
Spain
Barcelona
28
Midfielder
22
2
Sam Kerr
Australia
Chelsea
28
Forward
2
3
Vivianne Miedema
Netherlands
Arsenal
25
Forward
3
4
Caroline Graham Hansen
Norway
Barcelona
27
Midfielder
9
5
Pernille Harder
Denmark
Chelsea
29
Forward
4
6
Catarina Macario
United States
Lyon
22
Midfielder
Not ranked
7
Marie-Antoinette Katoto
France
Paris Saint-Germain
23
Forward
19
8
Jennifer Hermoso
Spain
Pachuca
32
Forward
17
9
Aitana Bonmati
Spain
Barcelona
24
Midfielder
Not ranked
10
Ada Hegerberg
Norway
Lyon
26
Forward
Not ranked
11
Wendie Renard
France
Lyon
31
Defender
11
12
Christiane Endler
Chile
Lyon
30
Goalkeeper
30
13
Magdalena Eriksson
Sweden
Chelsea
28
Defender
34
14
Fran Kirby
England
Chelsea
29
Forward
12
15
Lieke Martens
Netherlands
Paris Saint-Germain
29
Forward
28
16
Lauren Hemp
England
Manchester City
21
Forward
Not ranked
17
Mapi Leon
Spain
Barcelona
27
Defender
Not ranked
18
Irene Paredes
Spain
Barcelona
30
Defender
Not ranked
19
Rose Lavelle
United States
OL Reign
27
Midfielder
15
20
Beth Mead
England
Arsenal
27
Forward
Not ranked
21
Debinha
Brazil
North Carolina Courage
30
Forward
10
22
Lindsey Horan
United States
Lyon
28
Midfielder
23
23
Stina Blackstenius
Sweden
Arsenal
26
Forward
Not ranked
24
Patri Guijarro
Spain
Barcelona
24
Midfielder
Not ranked
25
Ji So-Yun
South Korea
Suwon FC
31
Midfielder
18
26
Kadeisha Buchanan
Canada
Chelsea
26
Defender
33
27
Ellie Carpenter
Australia
Lyon
22
Defender
Not ranked
28
Ashley Lawrence
Canada
Paris Saint-Germain
27
Defender
Not ranked
29
Amandine Henry
France
Lyon
32
Midfielder
16
30
Kim Little
Scotland
Arsenal
31
Midfielder
39
31
Lucy Bronze
England
Barcelona
30
Defender
5
32
Fridolina Rolfo
Sweden
Barcelona
28
Defender
Not ranked
33
Trinity Rodman
United States
Washington Spirit
20
Forward
Not ranked
34
Jessie Fleming
Canada
Chelsea
24
Midfielder
Not ranked
35
Kadidiatou Diani
France
Paris Saint-Germain
27
Forward
36
36
Alex Morgan
United States
San Diego Wave
32
Forward
38
37
Sam Mewis
United States
Kansas City Current
29
Midfielder
1
38
Millie Bright
England
Chelsea
28
Defender
Not ranked
39
Sara Dabritz
Germany
Lyon
27
Midfielder
Not ranked
40
Barbara Bonansea
Italy
Juventus
31
Forward
Not ranked
41
Delphine Cascarino
France
Lyon
25
Midfielder
21
42
Caroline Weir
Scotland
Free agent
26
Midfielder
25
43
Asisat Oshoala
Nigeria
Barcelona
27
Forward
27
44
Jess Fishlock
Wales
OL Reign
35
Midfielder
Not ranked
45
Tabea Wassmuth
Germany
Wolfsburg
25
Forward
Not ranked
46
Lea Schuller
Germany
Bayern Munich
24
Forward
Not ranked
47
Leah Williamson
England
Arsenal
25
Defender
Not ranked
48
Caitlin Foord
Australia
Arsenal
27
Forward
Not ranked
49
Christine Sinclair
Canada
Portland Thorns
39
Forward
Not ranked
50
Jill Roord
Netherlands
Wolfsburg
25
Midfielder
Not ranked
Code
# Read in datatop50 = pd.read_csv("../data/top50_women_espn.csv")# Convert all columns to strings (used for text later)top50 = top50.astype(str)top50.columns = ["name", "country", "club", "age", "position", "rank"]def is_uk(country):if country in ["Scotland", "England", "Wales"]: return"United Kingdom"else: return countrytop50["country2"] = top50["country"].apply(lambda x: is_uk(x))# Get summary stats per countrytop50_2 = top50.copy()# Make age column an integertop50_2["age"] = top50_2["age"].astype(int)# Make rank column an integer and replace not ranked with 0def replace_not_ranked(string):if string =="Not ranked": return np.nanelse: returnint(string)top50_2["rank"] = top50_2["rank"].apply(lambda x: replace_not_ranked(x))stats = top50_2.groupby(["country2"])["age", "rank"].mean().reset_index()stats["age"] = stats["age"].round(1)stats["rank"] = stats["rank"].round(1).fillna("Not Ranked")stats.columns = ["country", "mean_age", "mean_rank"]# Getting the number of players for each countrycountry_counts = top50["country2"].value_counts().reset_index()country_counts.columns = ["country", "count"]# Get country codescodes = ["GB", "ES", "US", "FR", "CA", "AU", "NL", "SE", "DE", "NO", "DK", "CL", "BR", "KR", "IT", "NG"]codes2 = ["GBR", "ESP", "USA", "FRA", "CAN", "AUS", "NDL", "SWE", "DEU", "NOR", "DNK", "CHL", "BRA", "KOR", "ITA", "NGA"]# Merge with count data frame and stats data framemerged_counts = pd.merge(country_counts, stats)merged_counts["code"] = codesmerged_counts["code2"] = codes2merged_counts = merged_counts.rename(columns = {"count": "counts"})# Adding a text column with player information for each countrymerged_counts["text"] =""# Loop through each country in the country count datafor i inrange(len(merged_counts)):# Set the current country country = merged_counts.country[i]# Add summary stats for that country merged_counts.text[i] ="SUMMARY STATS: "+ country.upper()+"<br>Number in the Top 50: "+str(merged_counts.counts[i]) +"<br>Mean Player Age: "+str(merged_counts.mean_age[i]) +"<br>Mean Player Ranking: "+str(merged_counts.mean_rank[i])# Make a custom color mapcolors = ["#27297F", "#404E7E", "#405C7E", "#406C7E", "#40787E", "#407E78", "#407E6D", "#407E5A", "#437E40"]# Create the choropleth tracechoropleth_trace = go.Choropleth( locations = merged_counts["code2"], z = merged_counts["counts"], text = merged_counts["text"], colorscale = colors, autocolorscale =False, reversescale =True, marker_line_color ="black", marker_line_width =0.5, colorbar_title ="Number of Players <br>in the Top 50", hovertemplate ="<b>%{text}</b><br>", hoverinfo ="name",)# Create the figureglobe = go.Figure(data = [choropleth_trace])globe.update_layout( plot_bgcolor ="#F1F0DA", paper_bgcolor ="#F1F0DA", geo =dict( projection_type ="orthographic", showland =True, landcolor ="#9FC5AA", oceancolor ="rgb(152, 190, 217)", showcountries =True, showlakes =False, showocean =True, countrycolor ="rgb(30, 56, 38)", lakecolor ="rgb(135, 206, 250)", bgcolor ="#F1F0DA" ), title ="Origin Countries of the Top 50 Players to <br>Watch Out for in the 2023 FIFA Women's World Cup", title_x =0.5, width =900, height =700, annotations=[dict( x =0.5, y =-0.05, showarrow =False, text ='<br>Data Source: "ESPN FC Women\'s Rank: The 50 best footballers in the world today" [2]', xref ="paper", yref ="paper", font =dict( size =12, color ="black" ) ) ])# Show the figureglobe.show()
This globe serves as a choropleth plot showing where the key players to watch out for come from. We can see that many of these top performing players come from the United Kingdom and will play on the Scottish, Welsh, and English teams in the upcoming World Cup. Spain and the United States are also home to several of these key players. By hovering over each country, we are able to see different summary statistics of the key players that call their respective country home. For example, of the six American players in ESPN’s list, the average age was around 26.3; meanwhile, of the six Spanish players in this list, the average age is around 27.5. It will be exciting to see how these key players perform in the upcoming World Cup. We hope to gain insight into how they will perform from the following visualizations.
Winning Trends: Women’s World Cup Match Victories across Time
Now that we have a glimpse into which countries and players will play a key role in this upcoming World Cup, let us look back into the winning trends historically for nations playing in the Women’s FIFA World Cups. In this visualization, we want to see if there are trends in the winning streaks for different teams over time. We are looking at 30 different national teams and which teams made it to the top 10 teams for the FIFA Women’s World Cup for the various years between 1991 and 2019. This visualization is animated, allowing us to see how the top 10 teams are changing in current time over said time period.
Code
import warningswarnings.simplefilter('ignore')import numpy as npimport pandas as pdfrom raceplotly.plots import barplotimport plotly.io as piopio.renderers.default ="plotly_mimetype+notebook_connected"# read in the datadf_f = pd.read_csv("../data/kaggle_female_matches.csv")# converting datedf_f['date']=pd.to_datetime(df_f['date'])# adding yeardf_f['year'] = df_f['date'].dt.year# filter datedf_f_2019 = df_f[(df_f['year'] <2020) & (df_f['tournament'] =='FIFA World Cup')]# fix tie scoresdf_f_2019['winner'] = np.where(df_f_2019['home_score'] == df_f_2019['away_score'], 'tie', np.where(df_f_2019['home_score'] > df_f_2019['away_score'], df_f_2019['home_team'], df_f_2019['away_team']))df_f_2019['win_ct'] =1df_f_2019 = df_f_2019[df_f_2019['winner']!='tie']# Create pivot table to get all possible combinations of year and winnerpivot_table = pd.pivot_table(df_f_2019, index=['year'], values='win_ct', aggfunc=sum,columns=['winner'], fill_value=0)# Reshape the pivot table and reset the index to get the desired output formatoutput_df = pivot_table.stack().reset_index(name='win_ct')# Create bar race animationmy_raceplot = barplot(output_df, item_column='winner', value_column='win_ct', time_column='year', top_entries=10)# Add labels and titlesmy_raceplot.plot(item_label='Team', value_label="Women's FIFA World Cup Wins", frame_duration=2000, title="Top 10 Women's FIFA World Cup Teams by Tournament Wins and Year", time_label='Year:',)import plotly.graph_objects as go# Modify the plot to add text in bottom right cornermy_raceplot.fig.update_layout(paper_bgcolor='#F1F0DA',plot_bgcolor ="#F1F0DA",height =600,annotations=[dict( x =-0.08, y =-0.45, showarrow =False, text ='Data source: Kaggle [8]', xref ="paper", yref ="paper", font =dict( size =12, color ="black" ) ) ])#https://prod.liveshare.vsengsaas.visualstudio.com/join?ECC8F154DB9DEA0416B8B38B2DC029CC911A
As expected, the results for 1991 and 2019 are incredibily different. This is also because the number of teams playing the FIFA Women’s World Cup in 1991 was significantly lower than the number of teams in 2019. In 1991 the last team is North Korea with 0 wins. This is due to there not being many teams in the 1991 tournement. By 1995, there already is an increase in the number of teams playing, and the plot no longer showing countries with 0 wins. A specific group of countries stays in the top 10, such as the United States, Norway, Germany, Sweden, Brazil, and China PR. There are changes in the position of these countries on the chart, with the United States and Norway staying in the top half and everything else changing. England also starts to make an appearance, as most of the stadiums the tournaments are being played in anticipation of the FIFA Women’s World Cup are in England.
Starting with the 2000s, we see Germany moving up on the chart and staying in the top 5 until 2019. For the rest of the 2000s, quite a few countries, such as Ghana or Japan, appeared only for one year and disappeared. Finally, in 2007, we see England with only one win, but moving up on the chart and eventually ending up third in 2019. Given the final result for 2019, we see similar results to what we see nowadays. Again, the United States, Netherlands, England, Sweden, and Germany are in the top 5, followed by France and Italy. And finally, Australia, Brazil, and Norway tied with two wins.
This tool could be used to predict future growth, as club managers could use it to identify national teams from which they want to recruit players. It is also great for the national team’s strategic planning when deciding which players should start for which matches, or help betting agencies and people that want to bet on matches.
Player Attributes: Comparing Performance at the 2019 Women’s World Cup
We imagine that taking a closer look into the most recent Women’s World Cup in 2019 will help provide insight into what we can expect this July, with most of the key players in ESPN’s FC rank list set to particpate. We took a subset of the data from the 2019 World Cup with the women who made it on ESPN’s list. A total of 13 players out of the 50 on the list scored during the 2019 World Cup. We will utilize the visualization below to compare their performance metrics when scoring their goals.
Code
# Load packagesimport pandas as pdimport altair as altimport numpy as npfrom datetime import datetime, time# Get the datastats = pd.read_csv("../data/wwc_2019_match_shots.csv")top50 = pd.read_csv("../data/top50_women_espn.csv")# Make our custom color schemecolor_scheme = ['#0099FF', '#009643', '#CB4349', '#FF818C', '#FCC92B', '#FD5109', '#CE6DD3','#FA8F38', '#8538B1', '#4983F8', '#A9DDD6', '#A2F17D', '#0C0582', '#960505']# Make a new minute timestamp columnstats["timestamp_minute"] = stats["minute"] + stats["second"]/60stats["timestamp_minute"] = stats["timestamp_minute"].apply(lambda x: round(x, 2))# Rename player columnstats = stats.rename(columns = {"player.name": "player_name"})# Standardize Alex Morgan namestats["player_name"] = stats["player_name"].replace("Alexandra Morgan Carrasco", "Alex Morgan")# Get the top 50 player namestop50_names =list(top50.Name)# Filter match data for only the players in top 50stats_top50 = stats[stats["player_name"].isin(top50_names)].reset_index(drop =True)# Add selection fieldsselection = alt.selection_single(fields = ["player_name"], name ="Random")## UPPER LEFT BAR CHART - # goals per playerbar1 = (alt.Chart(stats_top50) .mark_bar() .encode(y ="count()", x = alt.X("player_name:N", sort = alt.EncodingSortField(field ="timestamp_minute", op ="count", order ="ascending"), axis = alt.Axis(labelAngle =45)), color = alt.condition(selection, alt.Color("player_name:N", scale = alt.Scale(range= color_scheme), title ="Player Name"), alt.value("lightgray"))).add_selection(selection).properties( title = {"text": "Number of 2019 World Cup Goals by Player"}, width =450, height =225))bar1.encoding.x.title ="Player Name"bar1.encoding.y.title ="Number of World Cup Goals"# LOWER LEFT BAR CHART - # mean timestamp of player's goalsbar2_intitial = (alt.Chart(stats_top50) .mark_bar() .encode(y ='mean(timestamp_minute):Q', x = alt.X('player_name:N', sort = alt.EncodingSortField(field ="timestamp_minute", op ="mean", order ="ascending"), axis = alt.Axis(labelAngle =45)), color = alt.condition(selection, alt.Color("player_name:N", scale = alt.Scale(range= color_scheme), title ="Player Name"), alt.value("lightgray"))#color = alt.condition(selection, alt.value("#009643"), alt.value("lightgray")))).add_selection(selection).properties( title = {"text": "Average Time These Players Scored"}, width =450, height =225))bar2_intitial.encoding.x.title ="Player Name"bar2_intitial.encoding.y.title ="Mean Time of Goal (min)"# HALFTIME LINE AND OVERTIME HORIZONTAL LINES halftime_text_h = alt.Chart(pd.DataFrame({"y": [47], "text": ["Halftime"]})).mark_text(color ="#505050", align ="left").encode(x = alt.value(5), y ="y", text ="text")overtime_text_h = alt.Chart(pd.DataFrame({"y": [92], "text": ["Overtime"]})).mark_text(color ="#505050", align ="left").encode(x = alt.value(5), y ="y", text ="text")halftime_h = alt.Chart(pd.DataFrame({"y": [45]})).mark_rule(color ="#505050").encode(y ="y")overtime_h = alt.Chart(pd.DataFrame({"y": [90]})).mark_rule(color ="#505050").encode(y ="y")bar2 = alt.layer(bar2_intitial, halftime_h, overtime_h, halftime_text_h, overtime_text_h)# Add caption with data sourcebar2 = alt.concat(bar2).properties(title = alt.TitleParams( [" ", " ", "Data Source: ESPN FC Women's Rank 2022 Article [2] and Statsbomb [3]"], baseline ="bottom", orient ="bottom", anchor ="end", fontWeight ="normal", fontSize =11 ))# HALFTIME LINE AND OVERTIME VERTICAL LINES halftime_text_v = alt.Chart(pd.DataFrame({"x": [47], "text": ["Halftime"]})).mark_text(color ="#505050", align ="left").encode(y = alt.value(5), x ="x", text ="text")overtime_text_v = alt.Chart(pd.DataFrame({"x": [92], "text": ["Overtime"]})).mark_text(color ="#505050", align ="left").encode(y = alt.value(5), x ="x", text ="text")halftime_v = alt.Chart(pd.DataFrame({"x": [45]})).mark_rule(color ="#505050").encode(x ="x")overtime_v = alt.Chart(pd.DataFrame({"x": [90]})).mark_rule(color ="#505050").encode(x ="x")## SCATTER PLOT 1scatter1_i = (alt.Chart(stats_top50) .mark_circle(size =45) .encode(x = alt.X("timestamp_minute:Q"), y ="TimeInPoss:Q", color = alt.condition(selection, alt.Color("player_name:N", scale = alt.Scale(range= color_scheme), title ="Player Name"), alt.value("lightgray")) ).properties( title = {"text": "Possession Time Before Scoring by Player and Game Minute"}, width =450, height =167))scatter1_i.encoding.x.title ="Game Minute"scatter1_i.encoding.y.title ="Time in Possession (ms)"scatter1 = alt.layer(scatter1_i, halftime_v, overtime_v, halftime_text_v, overtime_text_v)## SCATTER PLOT 2scatter2_i = (alt.Chart(stats_top50) .mark_circle(size =45) .encode(x = alt.X("timestamp_minute:Q"), y ="avevelocity:Q", color = alt.condition(selection, alt.Color("player_name:N", scale=alt.Scale(range= color_scheme)), alt.value("lightgray"))).properties( title = {"text": "Average Ball Velocity When Scoring by Player and Game Minute"}, width =450, height =167))scatter2_i.encoding.x.title ="Game Minute"scatter2_i.encoding.y.title ="Average Ball Velocity (m/s)"scatter2 = alt.layer(scatter2_i, halftime_v, overtime_v, halftime_text_v, overtime_text_v)## SCATTER PLOT 3scatter3_i = (alt.Chart(stats_top50) .mark_circle(size =45) .encode(x = alt.X("timestamp_minute:Q"), y ="DistToGoal:Q", color = alt.condition(selection, alt.Color("player_name:N", scale=alt.Scale(range= color_scheme)), alt.value("lightgray"))).properties( title = {"text": "Distance to Goal When Scoring by Player and Game Minute"}, width =450, height =167))scatter3_i.encoding.x.title ="Game Minute"scatter3_i.encoding.y.title ="Distance to Goal (m)"scatter3 = alt.layer(scatter3_i, halftime_v, overtime_v, halftime_text_v, overtime_text_v)bar1 & bar2 | scatter1 & scatter2 & scatter3chart1 = alt.vconcat(bar1 , bar2)chart2 = alt.vconcat(scatter1 , scatter2, scatter3)alt.hconcat(chart1, chart2, spacing =5).configure(background ="#F1F0DA").configure_title(fontSize =15)
The first bar plot shows the total number 2019 World Cup goals scored for each of the players. We can see that the top three players with the most World Cup goals in 2019 were Vivianne Miedema, Alex Morgan, and Caroline Graham Hansen. The second bar plot shows the average time these players scored their goals by game minute. We can see which players scored earlier in the game and which players scored later in the game on average. Jill Rood, for example, scored her goals towards the end of the games.
The three scatter plots on the right are connected to these barplots. One can click on a player’s bar in the barplots on the left side and that player’s information will be highlighted in the three scatter plots on the right. These scatter plots show each player’s time in possession before their goals, the ball’s average velocity when scoring their goals, and the player’s distance to goal when scoring throughout a standard game’s time period (including overtime). To reiterate, this data is based off of the goals scored in the 2019 Women’s World Cup.
Field Stats: Shots & Outcomes at the 2019 Women’s World Cup Final
Below we see an innovative view of a modified scatter plot using matplotlib and plotly (not plotly express) where we creatively converted the coordinate plane into a football field. This view allows the audience to take a bird’s eye view into the actual shots taken during the last Women’s FIFA World Cup Final. Furthermore, the plot highlights how the offensive strategy taken by the US players contributed to their winning success compared to the shots taken by the Netherlands team.
Now that we had the opportunity to view some player’s overall game statistics, especially Alex Morgan’s dominant presence, let’s look at some field statistics that specifically happened at the women’s FIFA World Cup Final in 2019 between the United States and Netherlands. The plot below shows a soccer field with the geo locations of shots taken amongst all players. The left side represents shots taken by players on the Netherlands team while the right side represents shots taken by US players. The outcome of those shots are also viewable by color. Options include “Blocked”, a shot that was stopped from continuing by a defender. “Goal”, a shot that was deemed to cross the goal-line by officials. “Off T”, a shot that’s initial trajectory ended outside the posts. The last outcome is “Saved”, a shot that was saved by the opposing team’s keeper. By hovering over the shots on the field, you can see even more statistics such as the player’s name, time the ball was in their possession, the minute of the game, the body part used to take that shot, and the distance they were from the goal.
Code
import plotly.graph_objects as goimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib.patches import Rectangleimport seaborn as snsimport pandas as pd# read in the datashots = pd.read_csv("../data/shots.csv")net_shots = pd.read_csv("../data/net_shots.csv")# subset the dataus_shots = shots[['location.x','location.y','shot.outcome.name','player.name','shot.body_part.name','TimeInPoss','DistToGoal','minute']]net_shots = net_shots[['location.x','location.y','shot.outcome.name','player.name','shot.body_part.name','TimeInPoss','DistToGoal','minute']]# subtract the x-coordinates from the maximum x-coordinate value to get mirror image on fieldnet_shots['location.x'] =120- net_shots['location.x']# join two dataframesdata_merge = pd.concat([us_shots,net_shots],axis=0) # round time in possdata_merge['TimeInPoss'] = data_merge['TimeInPoss'].round()data_merge['DistToGoal'] = data_merge['DistToGoal'].round()#009643 (green) #FF818C (pink) #F1F0DA (cream) #0099FF (blue) #FCC92B (yellow) #FD5109 (orange) fig = go.Figure()#create tracesfig.add_trace( go.Scatter( x=data_merge[data_merge['shot.outcome.name'] =='Goal']['location.x'], y=data_merge[data_merge['shot.outcome.name'] =='Goal']['location.y'], name='Goal', mode='markers', marker_color='#FF818C', customdata=data_merge[data_merge['shot.outcome.name'] =='Goal'][['player.name', 'TimeInPoss', 'DistToGoal','minute']], text=data_merge[data_merge['shot.outcome.name'] =='Goal']['shot.body_part.name'], hovertemplate='<br><br>Player: %{customdata[0]}<br>Time in Possession (ms): %{customdata[1]}<br>Distance to goal (m): %{customdata[2]}<br>Body part: %{text}<br>Game minute: %{customdata[3]}' ))fig.add_trace(go.Scatter( x=data_merge[data_merge['shot.outcome.name'] =='Off T']['location.x'], y=data_merge[data_merge['shot.outcome.name'] =='Off T']['location.y'], name='Off T', mode='markers', marker_color='#0099FF', customdata=data_merge[data_merge['shot.outcome.name'] =='Off T'][['player.name', 'TimeInPoss', 'DistToGoal','minute']], text=data_merge[data_merge['shot.outcome.name'] =='Off T']['shot.body_part.name'], hovertemplate='<br><br>Player: %{customdata[0]}<br>Time in Possession (ms): %{customdata[1]}<br>Distance to goal (m): %{customdata[2]}<br>Body part: %{text}<br>Game minute: %{customdata[3]}' ))fig.add_trace(go.Scatter( x=data_merge[data_merge['shot.outcome.name'] =='Blocked']['location.x'], y=data_merge[data_merge['shot.outcome.name'] =='Blocked']['location.y'], name='Blocked', mode='markers', marker_color='#FCC92B', customdata=data_merge[data_merge['shot.outcome.name'] =='Blocked'][['player.name', 'TimeInPoss', 'DistToGoal','minute']], text=data_merge[data_merge['shot.outcome.name'] =='Blocked']['shot.body_part.name'], hovertemplate='<br><br>Player: %{customdata[0]}<br>Time in Possession (ms): %{customdata[1]}<br>Distance to goal (m): %{customdata[2]}<br>Body part: %{text}<br>Game minute: %{customdata[3]}' ))fig.add_trace(go.Scatter( x=data_merge[data_merge['shot.outcome.name'] =='Saved']['location.x'], y=data_merge[data_merge['shot.outcome.name'] =='Saved']['location.y'], name='Saved', mode='markers', marker_color='#FD5109', customdata=data_merge[data_merge['shot.outcome.name'] =='Saved'][['player.name', 'TimeInPoss', 'DistToGoal','minute']], text=data_merge[data_merge['shot.outcome.name'] =='Saved']['shot.body_part.name'], hovertemplate='<br><br>Player: %{customdata[0]}<br>Time in Possession (ms): %{customdata[1]}<br>Distance to goal (m): %{customdata[2]}<br>Body part: %{text}<br>Game minute: %{customdata[3]}' ))# update tracesfig.update_traces(mode='markers', marker_size=12,marker_line_width=0.5)# update layoutfig.update_layout(plot_bgcolor='#009643', xaxis=dict(gridcolor='#009643',range=[0, 120],showticklabels=False), height=600, width=800, paper_bgcolor='#F1F0DA', yaxis=dict(range=[0, 80], gridcolor='#009643',showticklabels=False), annotations= [dict(x=0.5,y=1.15,xref='paper',yref='paper',text='Shot Locations and Outcomes',showarrow=False, font=dict(family='Arial',size=18,color='black')),dict(x=0.5,y=1.08,xref='paper',yref='paper',text="Women's World Cup 2019 - USA vs Netherlands",showarrow=False, font=dict(family='Arial',size=14,color='black')),dict(x=0.1,y=-0.10,xref='paper',yref='paper',text='Netherlands Team',showarrow=False, font=dict(family='Arial',size=18,color='black')),dict(x=0.9,y=-0.10,xref='paper',yref='paper',text='United States Team',showarrow=False, font=dict(family='Arial',size=18,color='black')),dict(x=1.16,y=-0.18,xref='paper',yref='paper',text='Data source: StatsBomb [3]',showarrow=False, font=dict(family='Arial',size=10,color='black')) ])# add shapesfig.add_shape(type='line', x0=60, y0=0, x1=60, y1=80, line=dict(color='white', width=0.6))fig.add_shape(type='rect', x0=0, y0=0, x1=120, y1=80, line=dict(color='white', width=0.5))fig.add_shape(type='rect', x0=0, y0=0, x1=60, y1=80, line=dict(color='white', width=0.6))fig.add_shape(type='rect', x0=0, y0=18, x1=18, y1=62, line=dict(color='white', width=0.6))fig.add_shape(type='rect', x0=102, y0=18, x1=120, y1=62, line=dict(color='white', width=0.6))fig.add_shape(type='rect', x0=0, y0=30, x1=6, y1=50, line=dict(color='white', width=0.6))fig.add_shape(type='rect', x0=114, y0=30, x1=120, y1=50, line=dict(color='white', width=0.6))fig.add_shape(type='rect', x0=0, y0=36, x1=0.5, y1=44, line=dict(color='white', width=0.6))fig.add_shape(type='rect', x0=120, y0=36, x1=119.5, y1=44, line=dict(color='white', width=0.6))fig.add_shape(type='circle', x0=49, y0=29, x1=71, y1=50, line=dict(color='white', width=0.6) )fig.show()
Overall, the plot clearly shows that more shots were taken by US players than Netherland’s players. You can also see that no shots on the Netherland’s side resulted in a goal while two shots taken by US players resulted in a goal. This is in line with the final score of the game 2-0 (US vs Netherlands).
An interesting observation was that the majority of the Netherland’s players who took a shot had very little time with the ball compared to US players. On average, Netherlands players who took a shot were in possession of the ball for only 9 ms while US players were in possession of the ball for 23 ms on average. Most notably, Alex Morgan who took a shot that ended up outside the posts had the longest possession of the ball. Because Alex Morgan is a prominent US soccer player, visualizing her moves on the field is key. For this game, she shot with her left foot every time, and always in the second half of the game. In fact, although no shot of hers results in a goal, Alex Morgan made up over a quarter of all shots taken during the final; a very high percentage and indicative of her impressive skills on the field.
With this plot in mind, we are excited to see what the 2023 World Cup final has in store. Which team will ultimately take the crown as the 2023 Women’s FIFA champions?
Conclusion
Summary of Plots
We began this analysis by looking into which terms are the most associated with the FIFA World Cup. However, the main terms appear to be associated more with men’s FIFA. Therefore, we decided to explore the differences between funding men’s and women’s FIFA World Cup based on attendance. Although total men’s FIFA attendance is higher, men’s and women’s FIFA are closer when looking at the average and highest attendance. Based on average and highest attendance, there exists a massive discrepancy between the genders. Furthermore, we wanted to see if the financial difference reflected in the tournament over the years. Thus, Figure 3 depicts individual teams’ development and potential for growth. Knowing the rankings over the years, we looked into how the financial discrepancies affect the venues for different tournaments. Based on the four main women’s tournaments, it was found that when it comes to the location of the stadiums, there is a trend that many of the football stadiums that host women’s tournaments are located close to large cities. This is likely due to the lack of funds to afford the venue since we know from Figure 2 that it is not due to low attendance.
Looking back at our first figure, the plot showed Mbappé to be one of the terms most associated with FIFA, which according to Forbes, is the highest-paid football player in the world 1. Therefore, we decided to look into the key female players heading into this next World Cup. Figure 5 shows an interactive globe to discover where the key female players come from. Alex Morgan, who is ranked 38th and plays for the United States, has become very well known due to her talent in recent years. We can explore more of her skills in Figure 6 by seeing how many goals she scored in the 2019 World Cup, when she typically scores her goals throughout a game, and more.
Final Thoughts
Despite the women’s World Cup being the world’s largest women’s sporting tournament, finding concise information and digestible visualizations is extremely challenging compared to men’s football. Therefore, this analysis hopes to catalyze an investigation into the world of women’s football and visually explore its essential aspects. All of the visualizations can be helpful tools for team managers, betting agencies, or football fans that want to explore previous trends and predict the trends for the women’s FIFA World Cup 2023. We explored with this analysis the strengths and weaknesses of different teams, players, and tournaments, as well as the overall financial discrepancy in the men’s and women’s FIFA tournaments. We hope that your main takeaway from this analysis is how underrated, underfunded, and under-analyzed women’s football. In the next couple of years, we hope that the coverage and spending for women’s football will increase. Currently, it does seem as if FIFA is trying to increase coverage of the women’s World Cup. In fact, in 2021, they finally published their first ever analysis of the landscape that is women’s football 7. With trends showing the growth that women’s football has made in the past 20+ years, one can hope that there will no longer exist a discrepancy between the men’s and women’s football in the future.
References
[1] Birnbaum, J. (2022, October 7). The World’s Highest-Paid Soccer Players 2022: Kylian Mbappé Claims No. 1 While Erling Haaland Debuts. Forbes. https://www.forbes.com/sites/justinbirnbaum/2022/10/07/the-worlds-highest-paid-soccer-players-2022-kylian-mbapp-claims-no-1-while-erling-haaland-debuts/?sh=360a1ee8629d
[2] “ESPN FC Women’s Rank: The 50 Best Footballers in the World Today.” ESPN, ESPN Internet Ventures, 27 June 2022, https://www.espn.com/soccer/blog-espn-fc-united/story/4685632/espn-fc-womens-rank-the-50-best-footballers-in-the-world-today.
[3] StatsBomb. (2023, April 11). StatsBomb data: Event data. StatsBomb. Retrieved April 24, 2023, from https://statsbomb.com/what-we-do/hub/free-data/
[4] Wikipedia. (2023, April 16). Latitude and longitude of cities. Wikipedia. Retrieved April 24, 2023, from https://en.wikipedia.org/wiki/Geographic_coordinate_system
[5] Twitter. (2019). Get Twitter API. Twitter. Retrieved April 24, 2023, from https://api.twitter.com/2/tweets
[6] FIFA. (2022). Finances. Retrieved April 24, 2023, from https://www.fifa.com/about-fifa/organisation/finances
[7] FIFA. (2022, March 22). FIFA publishes first-ever comprehensive analysis of the elite women’s football landscape. FIFA.com. https://www.fifa.com/media-releases/fifa-publishes-first-ever-comprehensive-analysis-of-the-elite-women-s-football-l
[8] Jürisoo, M. (2022, August 1). Women’s international football results. Kaggle. Retrieved April 24, 2023, from https://www.kaggle.com/datasets/martj42/womens-international-football-results
[9] Published by Statista Research Department, & 18, J. (2023, January 18). Highest paid footballers worldwide 2022. Statista. Retrieved May 2, 2023, from https://www.statista.com/statistics/266636/best-paid-soccer-players-in-the-2009-2010-season/#:~:text=As%20of%20December%202022%2C%20Cristiano,dollars%20in%20off%2Dfield%20income.
[10] Gorostieta, D. (2023, March 14). Who are the highest-paid women’s soccer players in the world? Diario AS. Retrieved May 2, 2023, from https://en.as.com/soccer/who-are-the-highest-paid-womens-soccer-players-in-the-world-n/
Appendix
Our Color Scheme
We crafted our color palette based of the 2023 FIFA World Cup colors seen at the top of this webpage. Below is the exact color palette used throughout our visualizations. We picked and chose colors from this palette on a case by case basis depending on whether a plot required a discrete or continuous color palette in addition to how many colors where needed overall for a plot.