A Visual Journey through the Women’s FIFA World Cup

Authors: Amelia Baier, Sonali Dabhi, Mia Mayerhofer, Tereza Martinkova

Institution: Georgetown University

Check out our GitHub!

Introduction

The 2019 Women’s FIFA World Cup in France saw unprecedented interest in the tournament, with viewership, attendance, and digital engagement reaching record heights across the globe. Four years later, the Women’s FIFA World Cup is set to happen this July in New Zealand and Australia. Despite progress being made, significant inequality persists between men’s and women’s football. Through our data gathering process, we recognized how challenging it is to find concise information and digestible visualizations on women’s football compared with men’s football. Therefore, we intend to highlight this disparity through our first set of visualizations. These visualizations will cover a range of topics including popular phrases over time, overall spending differences between World Cups historically, and more.

As we approach the Women’s World Cup, we are especially motivated and inspired to gain a deeper understanding of various football characteristics surrounding women’s football. We hope to take you on a visual journey into the past and present of women’s football with our second set of visualizations. This set of visualizations will delve into winning trends over time, game statistics on the field, player attributes, and more relating soley to women’s FIFA. By examining these visualizations, we aim to gain important insights into what is in store for the 2023 Women’s World Cup.

The FIFA World Cup: A Uniting Global Event

The FIFA World Cup is typically regarded as the largest sporting events in the world with extensive media attention. What is the public discussing regarding FIFA and the World Cup?

Popular Phrases: An Insight into the FIFA World Cup

In the past couple of years, Twitter has become an international communication hub for a plethora of topics ranging from international politics to the latest sports news. Twitter’s rise has impacted how official news sources interact with the public, with accounts such as ESPN being able to update fans on the latest scores immediately. Because of this, it was determined that it would be a great source to find text data from three official FIFA news sources to grasp precisely what the FIFA World Cup is. Tweets were pulled from three main accounts, FIFAWorldCup, FIFAWCC (FIFA Women’s World Cup), and FIFAcom. Through Twitter API, tweets were pulled from as close to the past FIFA event as possible. This was done to see what keywords and phrases were being put out by these official accounts to their fans. Through this data collection, it was hoped that the overarching question ‘What exactly is FIFA?’ could be answered.

Code

import pandas as pd 
import json
from wordcloud import WordCloud
import plotly.graph_objs as go
import warnings
warnings.simplefilter('ignore')
#function that will create each trace
#below function was modified from the following link: 
#   https://github.com/PrashantSaikia/Wordcloud-in-Plotly
#changes were made so multiple wordclouds could be made from frequencies, as well as the location of the words remaining static, as well as color. 
def plotly_wordcloud(text, word_dict):
    wc = WordCloud(max_font_size=100).generate_from_frequencies(text) 
    word_list=[]
    freq_list=[]
    for word_info in wc.layout_:
        word_list.append(word_info[0][0])
        freq_list.append(word_info[0][1])

    # get the positions
    x = []
    y = []
    color_list = []
    #look in the dictonary to get the locations and colors 
    for i in word_list:
        position = word_dict[i]['position']
        color_list.append(word_dict[i]['color'])
        x.append(float(position[0]))
        y.append(float(position[1]))
    # get the relative occurence frequencies
    new_freq_list = [freq*100 for freq in freq_list]
    trace = go.Scatter(x=x, 
                       y=y, 
                       textfont = dict(size=new_freq_list,
                                       color=color_list),
                       textposition =  'middle center',
                       hoverinfo='text',
                       hovertext=['{0}{1}'.format(w, f) for w, f in zip(word_list, freq_list)],
                       mode='text',  
                       text=word_list, 
                       visible=False
                      )
    return trace

vec_df = pd.read_csv("../data/FIFAWorldCupvectorized.csv", index_col=0)
vec_FIFAcom = pd.read_csv("../data/FIFAcomvectorized.csv", index_col=0)
df_vector = pd.read_csv("../data/FIFAWCCvectorized.csv", index_col=0)
# Load the JSON file
with open('../data/dict_1.json') as f:
    dict_1 = json.load(f)
# Load the JSON file
with open('../data/dict_2.json') as f:
    dict_2 = json.load(f)
# Load the JSON file
with open('../data/dict_3.json') as f:
    dict_3 = json.load(f)

#need to generate a trace for each column 
date_1 = list(vec_df.columns)
date_2 = list(vec_FIFAcom.columns)
date_3 = list(df_vector.columns)

dataframes = [vec_df, vec_FIFAcom, df_vector]
wdict_list = [dict_1, dict_2, dict_3]

# initialize the fig 
fig = go.Figure()
for i, dataframe in enumerate(dataframes): 
    #creating the traces for the current twitter account: 
    for k in range(len(dataframe.columns)-1): 
        text = dataframe.iloc[:, k]
        word_dict = wdict_list[i]
        trace = plotly_wordcloud(text, word_dict)
        fig.add_trace(trace)


fig.data[0].visible = True
fig.update_layout(template="plotly_white")
fig.update_layout(xaxis=dict(showline=False, zeroline=False),
                  yaxis=dict(showline=False, zeroline=False))



matrix_false = []
for i in range(len(fig.data)):
    x = [False] * len(fig.data)
    x[i] = True
    matrix_false.append(x)

#creating the dictonaries: 
dic_sliders = {}
step_count = 0
account_dict = [('FIFAWorldCup', len(date_1)-1), ('FIFAcom',len(date_2)-1),('FIFAWWC',len(date_3)-1)]

for val, j in account_dict:
    steps = []
    for k in range(j):
        step = {'method': 'update',
        'args': [{'visible': matrix_false[step_count]}]}
        steps.append(step)
        step_count += 1
    dic_sliders[val] = [dict(active=0, currentvalue={"prefix": "Date: "}, pad={"t": 50}, steps=steps)]


fig.update_layout(sliders = dic_sliders['FIFAWorldCup']
         ,xaxis=dict(
        range=[-1,7],
        tickvals=[],
        ticktext=[]
        ),
    yaxis=dict(
        range=[-2, 8],
        tickvals=[],
        ticktext=[]
        )
)
fig.update_layout(
    updatemenus=[
        dict(
            buttons=list([
                dict(
                    label="FIFAWorldCup",
                    method="update",
                    args=[{"visible": matrix_false[0]}, {"sliders" : dic_sliders['FIFAWorldCup']}] ,
                ),
                dict(
                    label='FIFAcom',
                    method="update",
                    args=[{"visible": matrix_false[23]}, {"sliders" : dic_sliders['FIFAcom']}] ,
                ), 
                dict(
                    label= 'FIFAWWC',
                    method="update",
                    args=[{"visible": matrix_false[57]}, {"sliders" : dic_sliders['FIFAWWC']}] ,
                )
            ]),
            direction="down",
            showactive=True,
            pad={"r": 1, "t": 1},
            x=0,
            xanchor="left",
            y=2,
            yanchor="top"
        ),
    ],
    plot_bgcolor = "#F1F0DA",
    paper_bgcolor = "#F1F0DA",
    title = "Key Words From Official Tweets During the 2022 World Cup",
    title_x = 0.5,
     annotations=[
        dict(
            x = 0,
            y = 0,
            showarrow = False,
            text = 'Data Source: Tweets Pulled From @FIFAWorldCup, @FIFAWWC, @FIFAcom [5]',
            xref = "paper",
            yref = "paper",
            font = dict(
                size = 10,
                color = "black"
            )
        )
    ],
    width=900,
    height=500,

)

fig.show()

Figure 1: Time Series World Cloud

An interactive word cloud was created using Plotly, where the drop-down menu allows the user to specify what account they want to focus on the words, while the slider shows the accounts’ main phrases as the FIFA 2022 World Cup took place. The user would see many vital terms such as Qatar, medal, world, and goal, as well as other phrases to show that this event is truly an international experience. On specific days, the user can see which players are considered key, such as Messi and Mbappe.

A Game of Inequality: Highlighting Gender Disparity in FIFA

Even today, the inequality between men’s and women’s football persists especially in the areas of funding, coverage, salary, infrastructure, and more. With the visualizations in this section, we hope to not only explore but emphasize this disparity.

Financial Spending: How Men’s and Women’s FIFA Differ

Qatar made headlines with an exorbitant amount spent with the latest World Cup. However, it is common for countries to splurge when selected for the Men’s FIFA World Cup. With the intake in tourism and other factors, hosting the World Cup has its economic advantages. The plot below attempts to map this relationship between the amount a country spent and the total, average, and highest number of people who attended the World Cup.

Code

import pandas as pd
import altair as alt
import warnings
warnings.simplefilter('ignore')
df = pd.read_csv('../data/finan_data_long.csv')
df2 = pd.read_csv('../data/finan_data.csv', index_col=0)
df2['country_year'] = df2['country'] + ',' + df2['year'].astype(str)
df['country_year'] = df['country'] + ',' + df['year'].astype(str)
df3 = df2[df2['country'] != 'Qatar']
selection = alt.selection_single(fields=['country_year'],name='Random')
#color selection for bar plot
list_country = ['Qatar,2022','France,2019','Russia,2018','Canada,2015','Brazil,2014','Germany ,2011','South Africa,2010','China,2007', 'Germany ,2006','USA,2003','S. Korea/Japan,2002', 'USA,1999', 'France,1998','USA,1996']
list_country.reverse()
color = alt.condition(selection,
                      alt.value('#009643'),
                      alt.value('lightgray'), 
                      )
#bar plot for the amount spent (including Qatar)
bar1 = (alt.Chart(df2).mark_bar(size = 20)
       .encode(
        y=alt.Y('amount:Q'),
        x=alt.X('country_year:N', sort = list_country),
        color = color
        ).add_selection(selection)).properties(width=400,height=565, title = {
                                    'text' : ["Spending By Each Country"],
                                    "fontSize": 18,
                                    'subtitle': ["",""], })
#bar customization 
bar1.encoding.x.title = 'Year and Country FIFA Was Held'
bar1.encoding.y.title = 'Money Spent (Billions USD)'

#the second bar plot not including Qatar: 
bar2 = (alt.Chart(df3).mark_bar(size = 20)
       .encode(
        y=alt.Y('amount:Q'),
        x=alt.X('country_year:N', sort = list_country),
        color = color
        ).add_selection(selection)).properties(width=400,height=565, title = {
      "text": ["Spending By Each Country"] ,
      "fontSize": 18,
      "subtitle": ["(Not Including Qatar)","*units different than previous plot"], 
    })

#bar customization 
bar2.encoding.x.title = 'Year and Country FIFA Was Held'
bar2.encoding.y.title = 'Money Spent (Billions)*'

#color selection 2 for the bubble plot
color2 = alt.condition(selection,
                      alt.Color('Gender:N', scale = alt.Scale(range = ['#0099FF', '#FD5109'])),
                      alt.value('lightgray'))

#creating the selection box
category_select = alt.selection_single(fields=['type'], bind=alt.binding_select(options=df['type'].unique()))

#creating the circles: 
plot = alt.Chart(df).mark_circle().encode(
    x=alt.X('x:Q', axis = None),
    y=alt.Y('y:Q', axis = None),
    size=alt.Size('area:Q', scale=alt.Scale(domain=[0, 0.8], range=[0, 130000]), legend=None),
    color=color2,
    tooltip=['country:N', 'year:N', 'attendance:Q'], 
).transform_filter(
     category_select
).properties(
    width=750,
    height=565, 
    title = {
    "text": ["","Attendance for Each Country", ""],
    "fontSize": 16
    }
)

#creating the text so the bubbles are labeled
text = alt.Chart(df).mark_text(align='center', baseline='middle', color = '#F1F0DA', fontSize= 14).encode(
    x=alt.X('x:Q', axis = None),
    y=alt.Y('y:Q', axis = None),
    text='country:N'
).transform_filter(
   category_select
)

source = alt.Chart().mark_text(align='center', baseline='middle', fontSize= 14).encode().properties(
    title = {
    'text': ["", "", "", "", "Data Source: Wikipedia [4] and FIFA Financial Reports [6]"],
    'fontWeight': 'normal',
    'fontSize': 12}
)

bar_comb = alt.concat(bar1, bar2)

# create chart1 without configuration settings - this is so i can work the color selection/linked
#chart1 is the combined plot and text
chart1 = alt.layer(plot, text)

#the final chart, this is so we are will add all the configurations here
chart = alt.VConcatChart(vconcat=[bar_comb, chart1, source], 
                         title=alt.TitleParams(text=['Country Spending Compared to Game Attendance', ""], anchor='middle', fontSize=24),
                         background = '#F1F0DA',
                         center = True,
                         spacing = 10,
                         padding = {"left": 40, "top": 30, "right": 40, "bottom": 60},
                         config={
                             'view': {
                                 'stroke': '#F1F0DA',
                                 'fill': '#F1F0DA'
                             },
                             'axis': {
                                 'grid': False, 
                                 'labelFontSize': 14,
                                 'titleFontSize': 15,
                                 'labelAngle' : -45

                             },
                             'legend': {
                                'titleFontSize' :16,
                                'labelFontSize': 12,
                                'orient': 'bottom'
                             }
                         })
#adding the selection so that way the blob/text will be changed, we add it to the final plot
chart.add_selection(category_select)

Figure 2: A Linked View on FIFA Financial Spending Based on Attendance

The ‘total’ option reflects the sum attendance of all the stadium events, ‘average’ shows the number of people compared to the number of events, and ‘highest’ is the top number of people at one stadium during the overall tournament. Although we see that in the total attendance, Men’s FIFA is higher when looking at the average and highest, Men’s and Women’s FIFA are closer. However, when comparing the amount spent with the average and highest attendance, there is a massive discrepancy between the genders. Although one might attribute this maybe each country wanting to spend only a set amount no matter the gender, it is seen that France held the Men’s FIFA in 1998 and spent 2 billion on preparing for the games, while for the Women’s World Cup in 2019, they spent less than one-fourth of that. The goal of this visualization is so that the viewer can see the gender disparity between the two World Cups. To accurately understand the attendance, hover over the circles to see the number of people.

The Pay Gap: Comparing Men’s and Women’s Player Salaries

One of the first areas that comes to mind when people usually think about gender disparity in not only football, but most other sports as well, is the pay gap. In the visualization below, we hope to bring light to how drastic the difference in salary is between the best men and women’s football players just in the last year. We have also included the table used for the plot below.

Code

# Read in packages
import pandas as pd
import altair as alt
import numpy as np
import warnings
warnings.simplefilter('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

# Get the data
salariestable = pd.read_excel("../data/salary_comparison.xlsx")
salariestable.columns = [x.title() for x in list(salariestable.columns)]
# Convert all columns to strings (used for text later)
salariestable = salariestable.astype(str)
salariestable.index += 1 
# Change column names
from IPython.display import Markdown
from tabulate import tabulate
Markdown(tabulate(
  salariestable, 
  headers = ["Player Name", "Salary", "Year", "Gender", "Country"]
))

Table 1: Salaries of the Top Nine Men and Female Football Players 2022-2023
	Player Name	Salary	Year	Gender	Country
1	Cristiano Ronaldo	200000000	2022	Male	Portugal
2	Kylian Mbappé	110000000	2022	Male	France
3	Lionel Messi	65000000	2022	Male	Argentina
4	Neymar	55000000	2022	Male	Brazil
5	Mohamed Salah	35000000	2022	Male	Egypt
6	Erling Haaland	35000000	2022	Male	Norway
7	Robert Lewandowski	27000000	2022	Male	Poland
8	Eden Hazard	27000000	2022	Male	Belgium
9	Andres Iniesta	25000000	2022	Male	Spain
10	Sam Kerr	513000	2023	Female	Australia
11	Alex Morgan	450000	2023	Female	United States
12	Magan Rapinoe	447000	2023	Female	United States
13	Julie Ertz	430000	2023	Female	United States
14	Ada Hegerberg	425000	2023	Female	Norway
15	Marta Vieira	400000	2023	Female	Brazil
16	Amandine Henry	394000	2023	Female	France
17	Wendie Renard	392000	2023	Female	France
18	Christine Sinclair	380000	2023	Female	Canada

Code

# Load packages
import pandas as pd
import altair as alt
import numpy as np
import warnings
warnings.simplefilter('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

# Get the data
salaries = pd.read_excel("../data/salary_comparison.xlsx")
salaries.columns = [x.title() for x in list(salaries.columns)]
female_salaries = salaries[salaries["Gender"] == "Female"]
male_salaries = salaries[salaries["Gender"] == "Male"]

# Prepare the data
female_salaries.reset_index(drop = True, inplace = True)
male_salaries.Salary = male_salaries.Salary.astype(int)
female_salaries.Salary = female_salaries.Salary.astype(int)

# Make our custom color scheme
color_scheme = ["#009643", "#CB4349"]

# Add selection fields
selection = alt.selection_single(fields = ["Player"])

both = alt.Chart(salaries).mark_bar().encode(
    x = alt.X("Salary:Q",
            title = "Salary (USD)",
            sort = "descending",
            scale = alt.Scale(domain = (200500000, 0))),
    y = alt.Y("Player:N", axis = alt.Axis(title = "Player Name", orient = "left"), sort = alt.Sort(field = "Salary", order = "descending")),
    color = alt.Color("Salary:Q", scale = alt.Scale(range = color_scheme))
).properties(
    title = ["Pay Gap Between the Highest Paid Men's and Women's Football Players"], 
    width = 900, 
    height = 225
)

text1 = both.mark_text(
    align = "left",
    baseline = "middle",
    dx = 1
).encode(
    text = "Country:N"
)

both_final = (both + text1)

# Add caption for when salaries were collected
both_final = alt.concat(both_final).properties(title = alt.TitleParams(
        ["**Note: men's salaries are from December 2022 and women's salaries are from March 2023**", " "],
        baseline = "top",
        orient = "top",
        anchor = "end",
        fontWeight = "normal",
        fontSize = 11
    ))


# Men's chart
men_chart = alt.Chart(male_salaries).mark_bar().encode(
    x = alt.X("Salary:Q",
            title = "Salary (USD)",
            sort = "descending",
            scale = alt.Scale(domain = (0, 230000000))),
    y = alt.Y("Player:N", axis = alt.Axis(title = "Player"), sort = alt.Sort(field = "Salary", order = "descending")),
    color = alt.Color("Salary:Q", scale = alt.Scale(range = color_scheme))
).properties(
    title = ["Highest Paid Male Football Players"], 
    width = 390, 
    height = 225
)

textm = men_chart.mark_text(
    align = "right",
    baseline = "middle",
    dx = -3
).encode(
    text = "Country:N"
)

men_chart_final = (men_chart + textm)

women_chart = alt.Chart(female_salaries).mark_bar().encode(
    x = alt.X("Salary:Q",
            title = "Salary (USD)",
            sort = "descending",
            scale = alt.Scale(domain = (600000, 0))),
    y = alt.Y("Player:N", axis = alt.Axis(title = "Player Name", orient = "right"), sort = alt.Sort(field = "Salary", order = "descending")),
    color = alt.Color("Salary:Q", scale = alt.Scale(range = color_scheme))
).properties(
    title = ["Highest Paid Female Football Players (ZOOMED IN)"], 
    width = 390, 
    height = 225
)

textw = women_chart.mark_text(
    align = "left",
    baseline = "middle",
    dx = 3
).encode(
    text = "Country:N"
)

women_chart_final = (women_chart + textw)

# Add caption with data source
women_chart_final = alt.concat(women_chart_final).properties(title = alt.TitleParams(
        [" ", " ", "Data Source: Statista [9] and AS USA [10]"],
        baseline = "bottom",
        orient = "bottom",
        anchor = "end",
        fontWeight = "normal",
        fontSize = 11
    ))

bottom = alt.hconcat(men_chart_final , women_chart_final, spacing = 0)

final = alt.vconcat(both_final, bottom)
final.configure(background = "#F1F0DA").configure_title(fontSize = 15)

Figure 3: Comparison of Top Men’s and Women’s Player Salaries in the Last Year

From the two plots on top, it is clear that women’s player salaries are drastically less than the men’s. When plotted on the same scale, we are not even able to see any bars show up on the women’s side because their salaries are so low in comparison. To put it into perspective, the highest paid female player, Sam Kerr, made 47 times less this last year than the 9th highest paid male player, Andres Iniesta. While this pay gap has recieved a lot of attention in recent years, there is, unfortunately, much more that needs to be done to bridge this inequality. On other important note to make is how difficult it was to find recent data on female players’ salaries. It is crucial to give more attention to this disparity in order to facilitate meaningful change.

Infrastructure: Exploring Match Locations for Women’s Tournaments

Regarding the many different tournaments leading to the FIFA World Cup, there is a particular trend in the location selection for said tournaments. There are three main tournaments preceding the FIFA World Cup, which is considered to be the main event for women’s football. These three tournaments are:

FA Women’s Super League
National Women’s Soccer League (NWSL)
UEFA Women’s Euro

First, we want to see which countries and specific stadiums are preferred for which competitions. This will act as the basis for the next step, which is identifying which countries have the most considerable amount of players with the highest rankings.

For this visualization, we look at the period between 2018 and 2022. FA Women’s Super Leauge has the most data by having the different stadium locations for 2018, 2019, 2020, and 2021. Already by the name, we can expect the National Women’s Soccer League to have preferred stadiums in the United States when hosting the FIFA World Cup. In fact, the FIFA World Cup location changes the most, as it depends on the host country. For example, in 2019, the FIFA World Cup was hosted in France, with its main stadium being the Parc Olympique Lyonnais, also called the Groupama Stadium in Décines-Charpieu, France.

Code

#IMPORTING LIBRARIES
import plotly.graph_objects as go
import plotly.io as pio
import numpy as np
import pandas as pd
pio.renderers.default = "plotly_mimetype+notebook_connected"

# Importing data
df = pd.read_csv("../data/clean_matches.csv")

# CHANGE DATA TYPE OF YEAR TO STRING SO IT CAN BE ITERABLE
df["year"] = df["year"].astype("str")

# Create subsets of data to create traces
df1_2018 = pd.DataFrame(df[(df['competition_name'] == 'FA Womens Super League') & (df['year'] == '2018')])
df1_2019 = pd.DataFrame(df[(df['competition_name'] == 'FA Womens Super League') & (df['year'] == '2019')])
df1_2020 = pd.DataFrame(df[(df['competition_name'] == 'FA Womens Super League') & (df['year'] == '2020')])
df1_2021 = pd.DataFrame(df[(df['competition_name'] == 'FA Womens Super League') & (df['year'] == '2021')])
df2 = pd.DataFrame(df[df['competition_name']=='NWSL'])
df3 = pd.DataFrame(df[df['competition_name']=='UEFA Womens Euro'])
df4 = pd.DataFrame(df[df['competition_name']=='Womens World Cup'])




# TRACE-1: 
trace1 = (  
    go.Scattergeo(  
        lat=df1_2018['lat'],
        lon=df1_2018['long'],
        text=df1_2018['text'],
        mode='markers',
        marker=dict(
            size=df['freq'],
            color='#FCC92B',
            opacity=0.8,
            symbol='circle'),
        name = "FA Women's Super League 2018", 
        visible=True))

# TRACE-2: 
trace2 = (  
    go.Scattergeo(  
        lat=df1_2019['lat'],
        lon=df1_2019['long'],
        text=df1_2019['text'],
        mode='markers',
        marker=dict(
            size=df['freq'],
            color='#009643',
            opacity=0.8,
            symbol='circle'),
        name = "FA Women's Super League 2019",
        visible=True))

# TRACE-3: 
trace3 = (  
    go.Scattergeo(  
        lat=df1_2020['lat'],
        lon=df1_2020['long'],
        text=df1_2020['text'],
        mode='markers',
        marker=dict(
            size=df['freq'],
            color='#009AFE',
            opacity=0.8,
            symbol='circle'),
        name = "FA Women's Super League 2020", 
        visible=True))


# TRACE-4: 
trace4 = (  
    go.Scattergeo(  
        lat=df1_2021['lat'],
        lon=df1_2021['long'],
        text=df1_2021['text'],
        mode='markers',
        marker=dict(
            size=df['freq'],
            color='#FF818D',
            opacity=0.8,
            symbol='circle'),
        name = "FA Women's Super League 2021",
        visible=True))

# TRACE-5: 
trace5 = (  
    go.Scattergeo(  
        lat=df2['lat'],
        lon=df2['long'],
        text=df2['text'],
        mode='markers',
        marker=dict(
            size=df['freq'],
            color='#FF5108',
            opacity=0.8,
            symbol='circle'),
        name = "National Women's Soccer League 2018",
        visible=True))

# TRACE-6: 
trace6 = (  
    go.Scattergeo(  
        lat=df3['lat'],
        lon=df3['long'],
        text=df3['text'],
        mode='markers',
        marker=dict(
            size=df['freq'],
            color='#8538B1',
            opacity=0.8,
            symbol='circle'),
        name = "UEFA Womens Euro 2022",
        visible=True))

# TRACE-7: 
trace7 = (  
    go.Scattergeo(  
        lat=df4['lat'],
        lon=df4['long'],
        text=df4['text'],
        mode='markers',
        marker=dict(
            size=df['freq'],
            color='#960505',
            opacity=0.8,
            symbol='circle'), 
        name = 'Womens World Cup 2019',
        visible=True))

# COMBINING TRACES
traces = [trace1, trace2, trace3, trace4, trace5, trace6, trace7]

# INITIALIZE GRAPH OBJECT
fig = go.Figure(data=traces)
        
# VARIABLES FOR BUTTON LOCATION
button_height = 0.15
x1_loc = 0.00
y1_loc = 1

# ADD ANNOTATION
fig.add_annotation(
    x=0,
    y=0,
    text='Data Source: StatsBomb [3], Wikipedia [4]',
    showarrow=False,
    visible=True,
)


#DROPDOWN MENUS
fig.update_layout(
    title = 'Location of Stadiums by Tournament and Number of Matches Played', 
    plot_bgcolor='#F1F0DA',
    paper_bgcolor='#F1F0DA',
    #annotations=annotations,
    geo=dict(
        lonaxis=dict(
            range=[-130, 20],  
        ),
        lataxis=dict(
            range=[20,65], 
        ),
        showocean=True,
        oceancolor='#F1F0DA'
    ),
    updatemenus=[
        dict(
            buttons=[
                dict(
                    label="FA Women's Super League",           
                    method="update",                
                     args=[{"visible": [True, True, True, True, False, False, False]}
                    ]
                     ),
                dict(
                    label="National Women's Soccer League 2018",              
                    method="update",          
                     args=[{"visible": [False, False, False, False, True, False, False]}
                         ]
                     ),
                dict(
                    label="UEFA Womens Euro 2022",     
                    method="update",           
                     args=[{"visible": [False, False, False, False, False, True, False]}
                         ]
                     ),    
                dict(
                    label="Womens World Cup 2019",               
                    method="update",          
                     args=[{"visible": [False, False, False, False, False, False, True]}
                        ]
                     )           
            ],
            direction="down",
            showactive=True,  
            pad={"r": 10, "t": 10},  
            x=x1_loc, 
            y=y1_loc,
            xanchor="left",  
            yanchor="top"
        )
    ],
            width = 900
)

# SHOW FIGURE
fig.show()

Figure 4: Stadium Location for Different Women’s Tournaments

Overall, the stadium distribution for all these tournaments is more concentrated in Europe. The one obvious exception, as mentioned earlier, is the National Women’s Soccer League, which has matches strictly in the United States. Therefore, most stadiums are located on the east coast, with the highest number of matches played on the east coast. However, a few stadiums are also on the west coast, and very few are in the Midwest and South.

For the FA Women’s Super League, we can see that different stadiums were selected for various years; however, all of them are located in the United Kingdom, specifically England. The marker size represents the number of matches played at the stadium but for each tournament separately. If we were to compare the FA Women’s Super League with the National Women’s Soccer League, the marker sizes wouldn’t be comparable since the highest number of matches played at any National Women’s Soccer League Stadium is 5, while for the FA Women’s Super League is over 30. Therefore, we added a trace text, allowing the user to hover over any stadium and see how many matches were played there. For the FA Women’s League, we can see the four major football cities with major football clubs in England: London, Manchester, Liverpool, and Birmingham, where Manchester and Liverpool are close enough to overlap on the map. We can also see some stadiums west of London. We would expect these to be the other major cities in England, such as Southampton, Bath, or Bournemouth. Once we hover over these stadiums, we can see that many are in smaller towns that might not be as well known, which is also true for some of the UEFA Women’s Euro Tournament.

Similarly, like for the FA Women’s League and the UEFA Women’s Euro, we can see that the stadium distribution is highly concentrated in England. We can again see clubs in the London and Manchester/Liverpool area; however, yet again, once we hover over the separate stadiums, they are usually located in a smaller city near one of the major cities.

Finally, for the FIFA Women’s World Cup, the host country changes every year, and in 2019, as mentioned before, France hosted the tournament. Therefore, we can see an equal distribution geographically across all of France. However, from the map, we can identify the stadiums around the major French cities, such as Paris in the north of France, Lyon in the central south, and Marseille on the south coastline. West of Marseille, we can also see Montpellier, which has one of France’s bigger stadiums. One of the interesting facts about the Women’s World Cup is that the research shows the Groupama Stadium in Décubes-Charpieu was supposed to be the main stadium for this tournament; however, only three matches have been played there. In comparison, seven matches have been played at the Parc de Princess stadium in Paris. Hovering over the stadiums, we see once again that many of the locations are near major cities, but only a few are actually in the city.

To conclude, when it comes to the location of the stadiums, there is a trend that many of the football stadiums are located close to large cities, but they are not the large stadiums, which will likely host the same tournaments only for men. There is a stigma regarding funding sports events for men and women. Women’s professional sports have been known to be underfunded, which can be one of the reasons for such choices of smaller and less fancy stadiums. However, this hypothesis requires further analysis.

The Rise of Women’s FIFA: Celebrating the Players and their Skill

Now that we have looked more generally into the FIFA World Cup and highlighted the persistent inequality between men’s and women’s football, we would like to focus solely on these women and their teams leading up to the 2023 Women’s World Cup. These visualizations will showcase the performance of the teams and players during past World Cup games, while specifically focusing on the key players set to participate in the upcoming tournament.

Geographical Dynamics: Key Players for the 2023 Women’s World Cup

As previously discussed, the Women’s World Cup is a very high-scale and anticipated event this year. There are many players who are currently at the top of their game, and we would like to showcase these players. In the summer of 2022, ESPN came out with a list of the top 50 players to watch out for in this next World Cup [2]. The table below details the exact list from the article in the order that ESPN ranked the players. The choropleth globe below is based off of this competitive list to gain insight into which countries these key players will be representing this July.

Code

import warnings
warnings.simplefilter('ignore')
warnings.simplefilter(action='ignore', category=FutureWarning)

# Read in packages
import pandas as pd
import numpy as np
import plotly
import requests
import plotly.graph_objects as go
from plotly.offline import plot
import io
# Read in data
top50table = pd.read_csv("../data/top50_women_espn.csv")
# Convert all columns to strings (used for text later)
top50table = top50table.astype(str)
top50table.index += 1 
# Change column names
from IPython.display import Markdown
from tabulate import tabulate
Markdown(tabulate(
  top50table, 
  headers=["Name", "Country", "Club", "Age", "Position", "Rank"]
))

Table 2: ESPN’s List of the Top 50 Upcoming Players
	Name	Country	Club	Age	Position	Rank
1	Alexia Putellas	Spain	Barcelona	28	Midfielder	22
2	Sam Kerr	Australia	Chelsea	28	Forward	2
3	Vivianne Miedema	Netherlands	Arsenal	25	Forward	3
4	Caroline Graham Hansen	Norway	Barcelona	27	Midfielder	9
5	Pernille Harder	Denmark	Chelsea	29	Forward	4
6	Catarina Macario	United States	Lyon	22	Midfielder	Not ranked
7	Marie-Antoinette Katoto	France	Paris Saint-Germain	23	Forward	19
8	Jennifer Hermoso	Spain	Pachuca	32	Forward	17
9	Aitana Bonmati	Spain	Barcelona	24	Midfielder	Not ranked
10	Ada Hegerberg	Norway	Lyon	26	Forward	Not ranked
11	Wendie Renard	France	Lyon	31	Defender	11
12	Christiane Endler	Chile	Lyon	30	Goalkeeper	30
13	Magdalena Eriksson	Sweden	Chelsea	28	Defender	34
14	Fran Kirby	England	Chelsea	29	Forward	12
15	Lieke Martens	Netherlands	Paris Saint-Germain	29	Forward	28
16	Lauren Hemp	England	Manchester City	21	Forward	Not ranked
17	Mapi Leon	Spain	Barcelona	27	Defender	Not ranked
18	Irene Paredes	Spain	Barcelona	30	Defender	Not ranked
19	Rose Lavelle	United States	OL Reign	27	Midfielder	15
20	Beth Mead	England	Arsenal	27	Forward	Not ranked
21	Debinha	Brazil	North Carolina Courage	30	Forward	10
22	Lindsey Horan	United States	Lyon	28	Midfielder	23
23	Stina Blackstenius	Sweden	Arsenal	26	Forward	Not ranked
24	Patri Guijarro	Spain	Barcelona	24	Midfielder	Not ranked
25	Ji So-Yun	South Korea	Suwon FC	31	Midfielder	18
26	Kadeisha Buchanan	Canada	Chelsea	26	Defender	33
27	Ellie Carpenter	Australia	Lyon	22	Defender	Not ranked
28	Ashley Lawrence	Canada	Paris Saint-Germain	27	Defender	Not ranked
29	Amandine Henry	France	Lyon	32	Midfielder	16
30	Kim Little	Scotland	Arsenal	31	Midfielder	39
31	Lucy Bronze	England	Barcelona	30	Defender	5
32	Fridolina Rolfo	Sweden	Barcelona	28	Defender	Not ranked
33	Trinity Rodman	United States	Washington Spirit	20	Forward	Not ranked
34	Jessie Fleming	Canada	Chelsea	24	Midfielder	Not ranked
35	Kadidiatou Diani	France	Paris Saint-Germain	27	Forward	36
36	Alex Morgan	United States	San Diego Wave	32	Forward	38
37	Sam Mewis	United States	Kansas City Current	29	Midfielder	1
38	Millie Bright	England	Chelsea	28	Defender	Not ranked
39	Sara Dabritz	Germany	Lyon	27	Midfielder	Not ranked
40	Barbara Bonansea	Italy	Juventus	31	Forward	Not ranked
41	Delphine Cascarino	France	Lyon	25	Midfielder	21
42	Caroline Weir	Scotland	Free agent	26	Midfielder	25
43	Asisat Oshoala	Nigeria	Barcelona	27	Forward	27
44	Jess Fishlock	Wales	OL Reign	35	Midfielder	Not ranked
45	Tabea Wassmuth	Germany	Wolfsburg	25	Forward	Not ranked
46	Lea Schuller	Germany	Bayern Munich	24	Forward	Not ranked
47	Leah Williamson	England	Arsenal	25	Defender	Not ranked
48	Caitlin Foord	Australia	Arsenal	27	Forward	Not ranked
49	Christine Sinclair	Canada	Portland Thorns	39	Forward	Not ranked
50	Jill Roord	Netherlands	Wolfsburg	25	Midfielder	Not ranked

Code

# Read in data
top50 = pd.read_csv("../data/top50_women_espn.csv")
# Convert all columns to strings (used for text later)
top50 = top50.astype(str)
top50.columns = ["name", "country", "club", "age", "position", "rank"]
def is_uk(country):
    if country in ["Scotland", "England", "Wales"]: return "United Kingdom"
    else: return country
top50["country2"] = top50["country"].apply(lambda x: is_uk(x))

# Get summary stats per country
top50_2 = top50.copy()
# Make age column an integer
top50_2["age"] = top50_2["age"].astype(int)
# Make rank column an integer and replace not ranked with 0
def replace_not_ranked(string):
    if string == "Not ranked": return np.nan
    else: return int(string)
top50_2["rank"] = top50_2["rank"].apply(lambda x: replace_not_ranked(x))
stats = top50_2.groupby(["country2"])["age", "rank"].mean().reset_index()
stats["age"] = stats["age"].round(1)
stats["rank"] = stats["rank"].round(1).fillna("Not Ranked")
stats.columns = ["country", "mean_age", "mean_rank"]

# Getting the number of players for each country
country_counts = top50["country2"].value_counts().reset_index()
country_counts.columns = ["country", "count"]
# Get country codes
codes = ["GB", "ES", "US", "FR", "CA", "AU", "NL", "SE", "DE", "NO", "DK", "CL", "BR", "KR", "IT", "NG"]
codes2 = ["GBR", "ESP", "USA", "FRA", "CAN", "AUS", "NDL", "SWE", "DEU", "NOR", "DNK", "CHL", "BRA", "KOR", "ITA", "NGA"]
# Merge with count data frame and stats data frame
merged_counts = pd.merge(country_counts, stats)
merged_counts["code"] = codes
merged_counts["code2"] = codes2
merged_counts = merged_counts.rename(columns = {"count": "counts"})

# Adding a text column with player information for each country
merged_counts["text"] = ""
# Loop through each country in the country count data
for i in range(len(merged_counts)):
    # Set the current country
    country = merged_counts.country[i]
    # Add summary stats for that country
    merged_counts.text[i] = "SUMMARY STATS: " + country.upper()+ "<br>Number in the Top 50: " + str(merged_counts.counts[i]) + "<br>Mean Player Age: " + str(merged_counts.mean_age[i]) + "<br>Mean Player Ranking: " + str(merged_counts.mean_rank[i])

# Make a custom color map
colors = ["#27297F", "#404E7E", "#405C7E", "#406C7E", "#40787E", "#407E78", "#407E6D", "#407E5A", "#437E40"]

# Create the choropleth trace
choropleth_trace = go.Choropleth(
    locations = merged_counts["code2"],
    z = merged_counts["counts"],
    text = merged_counts["text"],
    colorscale = colors,
    autocolorscale = False,
    reversescale = True,
    marker_line_color = "black",
    marker_line_width = 0.5,
    colorbar_title = "Number of Players <br>in the Top 50",
    hovertemplate = "<b>%{text}</b><br>",
    hoverinfo = "name",
)
# Create the figure
globe = go.Figure(data = [choropleth_trace])
globe.update_layout(
    plot_bgcolor = "#F1F0DA",
    paper_bgcolor = "#F1F0DA",
    geo = dict(
        projection_type = "orthographic",
        showland = True,
        landcolor = "#9FC5AA",
        oceancolor = "rgb(152, 190, 217)",
        showcountries = True,
        showlakes = False,
        showocean = True,
        countrycolor = "rgb(30, 56, 38)",
        lakecolor = "rgb(135, 206, 250)",
        bgcolor = "#F1F0DA"
    ),
    title = "Origin Countries of the Top 50 Players to <br>Watch Out for in the 2023 FIFA Women's World Cup",
    title_x = 0.5,
    width = 900,
    height = 700, 
    annotations=[
        dict(
            x = 0.5,
            y = -0.05,
            showarrow = False,
            text = '<br>Data Source: "ESPN FC Women\'s Rank: The 50 best footballers in the world today" [2]',
            xref = "paper",
            yref = "paper",
            font = dict(
                size = 12,
                color = "black"
            )
        )
    ]
)
# Show the figure
globe.show()

Figure 5: Locations of the Key Players Participating in the 2023 World Cup

This globe serves as a choropleth plot showing where the key players to watch out for come from. We can see that many of these top performing players come from the United Kingdom and will play on the Scottish, Welsh, and English teams in the upcoming World Cup. Spain and the United States are also home to several of these key players. By hovering over each country, we are able to see different summary statistics of the key players that call their respective country home. For example, of the six American players in ESPN’s list, the average age was around 26.3; meanwhile, of the six Spanish players in this list, the average age is around 27.5. It will be exciting to see how these key players perform in the upcoming World Cup. We hope to gain insight into how they will perform from the following visualizations.

Winning Trends: Women’s World Cup Match Victories across Time

Now that we have a glimpse into which countries and players will play a key role in this upcoming World Cup, let us look back into the winning trends historically for nations playing in the Women’s FIFA World Cups. In this visualization, we want to see if there are trends in the winning streaks for different teams over time. We are looking at 30 different national teams and which teams made it to the top 10 teams for the FIFA Women’s World Cup for the various years between 1991 and 2019. This visualization is animated, allowing us to see how the top 10 teams are changing in current time over said time period.

Code

import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from raceplotly.plots import barplot
import plotly.io as pio

pio.renderers.default = "plotly_mimetype+notebook_connected"
# read in the data
df_f = pd.read_csv("../data/kaggle_female_matches.csv")
# converting date
df_f['date']=pd.to_datetime(df_f['date'])
# adding year
df_f['year'] = df_f['date'].dt.year
# filter date
df_f_2019 = df_f[(df_f['year'] < 2020) & (df_f['tournament'] == 'FIFA World Cup')]
# fix tie scores
df_f_2019['winner'] = np.where(df_f_2019['home_score'] == df_f_2019['away_score'], 'tie',
                               np.where(df_f_2019['home_score'] > df_f_2019['away_score'], df_f_2019['home_team'],
                                        df_f_2019['away_team']))
df_f_2019['win_ct'] = 1
df_f_2019 = df_f_2019[df_f_2019['winner']!='tie']

# Create pivot table to get all possible combinations of year and winner
pivot_table = pd.pivot_table(df_f_2019,  index=['year'], values='win_ct', aggfunc=sum,columns=['winner'], fill_value=0)

# Reshape the pivot table and reset the index to get the desired output format
output_df = pivot_table.stack().reset_index(name='win_ct')

# Create bar race animation
my_raceplot = barplot(output_df,  
                      item_column='winner', 
                      value_column='win_ct', 
                      time_column='year',
                      top_entries=10)

# Add labels and titles
my_raceplot.plot(item_label='Team',
                 value_label="Women's FIFA World Cup Wins",
                 frame_duration=2000,
                 title="Top 10 Women's FIFA World Cup Teams by Tournament Wins and Year",
                 time_label='Year:',)
import plotly.graph_objects as go

# Modify the plot to add text in bottom right corner
my_raceplot.fig.update_layout(paper_bgcolor='#F1F0DA',plot_bgcolor = "#F1F0DA",height = 600,
annotations=[
        dict(
            x = -0.08,
            y = -0.45,
            showarrow = False,
            text = 'Data source: Kaggle [8]',
            xref = "paper",
            yref = "paper",
            font = dict(
                size = 12,
                color = "black"
            )
        )
    ]
)

#https://prod.liveshare.vsengsaas.visualstudio.com/join?ECC8F154DB9DEA0416B8B38B2DC029CC911A

Figure 6: Comparative Analysis of Women’s Team Victories Over Time

As expected, the results for 1991 and 2019 are incredibily different. This is also because the number of teams playing the FIFA Women’s World Cup in 1991 was significantly lower than the number of teams in 2019. In 1991 the last team is North Korea with 0 wins. This is due to there not being many teams in the 1991 tournement. By 1995, there already is an increase in the number of teams playing, and the plot no longer showing countries with 0 wins. A specific group of countries stays in the top 10, such as the United States, Norway, Germany, Sweden, Brazil, and China PR. There are changes in the position of these countries on the chart, with the United States and Norway staying in the top half and everything else changing. England also starts to make an appearance, as most of the stadiums the tournaments are being played in anticipation of the FIFA Women’s World Cup are in England.

Starting with the 2000s, we see Germany moving up on the chart and staying in the top 5 until 2019. For the rest of the 2000s, quite a few countries, such as Ghana or Japan, appeared only for one year and disappeared. Finally, in 2007, we see England with only one win, but moving up on the chart and eventually ending up third in 2019. Given the final result for 2019, we see similar results to what we see nowadays. Again, the United States, Netherlands, England, Sweden, and Germany are in the top 5, followed by France and Italy. And finally, Australia, Brazil, and Norway tied with two wins.

This tool could be used to predict future growth, as club managers could use it to identify national teams from which they want to recruit players. It is also great for the national team’s strategic planning when deciding which players should start for which matches, or help betting agencies and people that want to bet on matches.

Player Attributes: Comparing Performance at the 2019 Women’s World Cup

We imagine that taking a closer look into the most recent Women’s World Cup in 2019 will help provide insight into what we can expect this July, with most of the key players in ESPN’s FC rank list set to particpate. We took a subset of the data from the 2019 World Cup with the women who made it on ESPN’s list. A total of 13 players out of the 50 on the list scored during the 2019 World Cup. We will utilize the visualization below to compare their performance metrics when scoring their goals.

Code

# Load packages
import pandas as pd
import altair as alt
import numpy as np
from datetime import datetime, time

# Get the data
stats = pd.read_csv("../data/wwc_2019_match_shots.csv")
top50 = pd.read_csv("../data/top50_women_espn.csv")

# Make our custom color scheme
color_scheme = ['#0099FF', '#009643', '#CB4349', '#FF818C', '#FCC92B', '#FD5109', '#CE6DD3','#FA8F38', '#8538B1', '#4983F8', '#A9DDD6', '#A2F17D', '#0C0582', '#960505']

# Make a new minute timestamp column
stats["timestamp_minute"] = stats["minute"] + stats["second"]/60
stats["timestamp_minute"] = stats["timestamp_minute"].apply(lambda x: round(x, 2))
# Rename player column
stats = stats.rename(columns = {"player.name": "player_name"})
# Standardize Alex Morgan name
stats["player_name"] = stats["player_name"].replace("Alexandra Morgan Carrasco", "Alex Morgan")
# Get the top 50 player names
top50_names = list(top50.Name)
# Filter match data for only the players in top 50
stats_top50 = stats[stats["player_name"].isin(top50_names)].reset_index(drop = True)

# Add selection fields
selection = alt.selection_single(fields = ["player_name"], name = "Random")

## UPPER LEFT BAR CHART - # goals per player
bar1 = (alt.Chart(stats_top50)
 .mark_bar()
 .encode(y = "count()",
         x = alt.X("player_name:N",
         sort = alt.EncodingSortField(field = "timestamp_minute", op = "count", order = "ascending"), axis = alt.Axis(labelAngle = 45)),
         color = alt.condition(selection, alt.Color("player_name:N", scale = alt.Scale(range = color_scheme), title = "Player Name"), alt.value("lightgray"))
).add_selection(selection
).properties(
    title = {"text": "Number of 2019 World Cup Goals by Player"}, 
    width = 450, 
    height = 225
))
bar1.encoding.x.title = "Player Name"
bar1.encoding.y.title = "Number of World Cup Goals"

# LOWER LEFT BAR CHART - # mean timestamp of player's goals
bar2_intitial = (alt.Chart(stats_top50)
 .mark_bar()
 .encode(y = 'mean(timestamp_minute):Q',
         x = alt.X('player_name:N',
         sort = alt.EncodingSortField(field = "timestamp_minute", op = "mean", order = "ascending"), axis = alt.Axis(labelAngle = 45)),
         color = alt.condition(selection, alt.Color("player_name:N", scale = alt.Scale(range = color_scheme), title = "Player Name"), alt.value("lightgray"))
         #color = alt.condition(selection, alt.value("#009643"), alt.value("lightgray")))
).add_selection(selection
).properties(
    title = {"text": "Average Time These Players Scored"}, 
    width = 450, 
    height = 225
))
bar2_intitial.encoding.x.title = "Player Name"
bar2_intitial.encoding.y.title = "Mean Time of Goal (min)"


# HALFTIME LINE AND OVERTIME HORIZONTAL LINES 
halftime_text_h = alt.Chart(pd.DataFrame({"y": [47], "text": ["Halftime"]})).mark_text(color = "#505050",  align = "left").encode(x = alt.value(5), y = "y", text = "text")
overtime_text_h = alt.Chart(pd.DataFrame({"y": [92], "text": ["Overtime"]})).mark_text(color = "#505050",  align = "left").encode(x = alt.value(5), y = "y", text = "text")
halftime_h = alt.Chart(pd.DataFrame({"y": [45]})).mark_rule(color = "#505050").encode(y = "y")
overtime_h = alt.Chart(pd.DataFrame({"y": [90]})).mark_rule(color = "#505050").encode(y = "y")
 
bar2 = alt.layer(bar2_intitial, halftime_h, overtime_h, halftime_text_h, overtime_text_h)

# Add caption with data source
bar2 = alt.concat(bar2).properties(title = alt.TitleParams(
        [" ", " ", "Data Source: ESPN FC Women's Rank 2022 Article [2] and Statsbomb [3]"],
        baseline = "bottom",
        orient = "bottom",
        anchor = "end",
        fontWeight = "normal",
        fontSize = 11
    ))

# HALFTIME LINE AND OVERTIME VERTICAL LINES 
halftime_text_v = alt.Chart(pd.DataFrame({"x": [47], "text": ["Halftime"]})).mark_text(color = "#505050",  align = "left").encode(y = alt.value(5), x = "x", text = "text")
overtime_text_v = alt.Chart(pd.DataFrame({"x": [92], "text": ["Overtime"]})).mark_text(color = "#505050",  align = "left").encode(y = alt.value(5), x = "x", text = "text")
halftime_v = alt.Chart(pd.DataFrame({"x": [45]})).mark_rule(color = "#505050").encode(x = "x")
overtime_v = alt.Chart(pd.DataFrame({"x": [90]})).mark_rule(color = "#505050").encode(x = "x")
 

## SCATTER PLOT 1
scatter1_i = (alt.Chart(stats_top50)
 .mark_circle(size = 45)
 .encode(x = alt.X("timestamp_minute:Q"),
         y = "TimeInPoss:Q",
         color = alt.condition(selection, alt.Color("player_name:N", scale = alt.Scale(range = color_scheme), title = "Player Name"), alt.value("lightgray"))
  ).properties(
    title = {"text": "Possession Time Before Scoring by Player and Game Minute"}, 
    width = 450, 
    height = 167
))
scatter1_i.encoding.x.title = "Game Minute"
scatter1_i.encoding.y.title = "Time in Possession (ms)"

scatter1 = alt.layer(scatter1_i, halftime_v, overtime_v, halftime_text_v, overtime_text_v)


## SCATTER PLOT 2
scatter2_i = (alt.Chart(stats_top50)
 .mark_circle(size = 45)
 .encode(x = alt.X("timestamp_minute:Q"),
         y = "avevelocity:Q",
         color = alt.condition(selection, alt.Color("player_name:N", scale=alt.Scale(range = color_scheme)), alt.value("lightgray"))
).properties(
    title = {"text": "Average Ball Velocity When Scoring by Player and Game Minute"}, 
    width = 450, 
    height = 167
))
scatter2_i.encoding.x.title = "Game Minute"
scatter2_i.encoding.y.title = "Average Ball Velocity (m/s)"

scatter2 = alt.layer(scatter2_i, halftime_v, overtime_v, halftime_text_v, overtime_text_v)

## SCATTER PLOT 3
scatter3_i = (alt.Chart(stats_top50)
 .mark_circle(size = 45)
 .encode(x = alt.X("timestamp_minute:Q"),
         y = "DistToGoal:Q",
         color = alt.condition(selection, alt.Color("player_name:N", scale=alt.Scale(range = color_scheme)), alt.value("lightgray"))
).properties(
    title = {"text": "Distance to Goal When Scoring by Player and Game Minute"}, 
    width = 450, 
    height = 167
))
scatter3_i.encoding.x.title = "Game Minute"
scatter3_i.encoding.y.title = "Distance to Goal (m)"

scatter3 = alt.layer(scatter3_i, halftime_v, overtime_v, halftime_text_v, overtime_text_v)


bar1 & bar2 | scatter1 & scatter2 & scatter3

chart1 = alt.vconcat(bar1 , bar2)
chart2 = alt.vconcat(scatter1 , scatter2, scatter3)
alt.hconcat(chart1, chart2, spacing = 5).configure(background = "#F1F0DA").configure_title(fontSize = 15)

Figure 7: A Linked View of Key Player Performance Metrics at the 2019 World Cup

The first bar plot shows the total number 2019 World Cup goals scored for each of the players. We can see that the top three players with the most World Cup goals in 2019 were Vivianne Miedema, Alex Morgan, and Caroline Graham Hansen. The second bar plot shows the average time these players scored their goals by game minute. We can see which players scored earlier in the game and which players scored later in the game on average. Jill Rood, for example, scored her goals towards the end of the games.

The three scatter plots on the right are connected to these barplots. One can click on a player’s bar in the barplots on the left side and that player’s information will be highlighted in the three scatter plots on the right. These scatter plots show each player’s time in possession before their goals, the ball’s average velocity when scoring their goals, and the player’s distance to goal when scoring throughout a standard game’s time period (including overtime). To reiterate, this data is based off of the goals scored in the 2019 Women’s World Cup.

Field Stats: Shots & Outcomes at the 2019 Women’s World Cup Final

Below we see an innovative view of a modified scatter plot using matplotlib and plotly (not plotly express) where we creatively converted the coordinate plane into a football field. This view allows the audience to take a bird’s eye view into the actual shots taken during the last Women’s FIFA World Cup Final. Furthermore, the plot highlights how the offensive strategy taken by the US players contributed to their winning success compared to the shots taken by the Netherlands team.

Now that we had the opportunity to view some player’s overall game statistics, especially Alex Morgan’s dominant presence, let’s look at some field statistics that specifically happened at the women’s FIFA World Cup Final in 2019 between the United States and Netherlands. The plot below shows a soccer field with the geo locations of shots taken amongst all players. The left side represents shots taken by players on the Netherlands team while the right side represents shots taken by US players. The outcome of those shots are also viewable by color. Options include “Blocked”, a shot that was stopped from continuing by a defender. “Goal”, a shot that was deemed to cross the goal-line by officials. “Off T”, a shot that’s initial trajectory ended outside the posts. The last outcome is “Saved”, a shot that was saved by the opposing team’s keeper. By hovering over the shots on the field, you can see even more statistics such as the player’s name, time the ball was in their possession, the minute of the game, the body part used to take that shot, and the distance they were from the goal.

Code

import plotly.graph_objects as go
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
import pandas as pd
 
# read in the data
shots = pd.read_csv("../data/shots.csv")
net_shots = pd.read_csv("../data/net_shots.csv")

# subset the data
us_shots = shots[['location.x','location.y','shot.outcome.name','player.name','shot.body_part.name','TimeInPoss','DistToGoal','minute']]
net_shots = net_shots[['location.x','location.y','shot.outcome.name','player.name','shot.body_part.name','TimeInPoss','DistToGoal','minute']]

# subtract the x-coordinates from the maximum x-coordinate value to get mirror image on field
net_shots['location.x'] = 120 - net_shots['location.x']

# join two dataframes
data_merge = pd.concat([us_shots,net_shots],axis=0) 

# round time in poss
data_merge['TimeInPoss'] = data_merge['TimeInPoss'].round()
data_merge['DistToGoal'] = data_merge['DistToGoal'].round()

#009643 (green) #FF818C (pink) #F1F0DA (cream) #0099FF (blue) #FCC92B (yellow) #FD5109 (orange) 

fig = go.Figure()
#create traces
fig.add_trace(
    go.Scatter(
        x=data_merge[data_merge['shot.outcome.name'] == 'Goal']['location.x'],
        y=data_merge[data_merge['shot.outcome.name'] == 'Goal']['location.y'],
        name='Goal',
        mode='markers',
        marker_color='#FF818C',
        customdata=data_merge[data_merge['shot.outcome.name'] == 'Goal'][['player.name', 'TimeInPoss', 'DistToGoal','minute']],
        text=data_merge[data_merge['shot.outcome.name'] == 'Goal']['shot.body_part.name'],
        hovertemplate='<br><br>Player: %{customdata[0]}<br>Time in Possession (ms): %{customdata[1]}<br>Distance to goal (m): %{customdata[2]}<br>Body part: %{text}<br>Game minute: %{customdata[3]}'
    )
)
fig.add_trace(go.Scatter(
    x=data_merge[data_merge['shot.outcome.name'] == 'Off T']['location.x'], 
    y=data_merge[data_merge['shot.outcome.name'] == 'Off T']['location.y'],    
    name='Off T',
    mode='markers',
    marker_color='#0099FF',
    customdata=data_merge[data_merge['shot.outcome.name'] == 'Off T'][['player.name', 'TimeInPoss', 'DistToGoal','minute']],
    text=data_merge[data_merge['shot.outcome.name'] == 'Off T']['shot.body_part.name'],
    hovertemplate='<br><br>Player: %{customdata[0]}<br>Time in Possession (ms): %{customdata[1]}<br>Distance to goal (m): %{customdata[2]}<br>Body part: %{text}<br>Game minute: %{customdata[3]}'
    ))
fig.add_trace(go.Scatter(
    x=data_merge[data_merge['shot.outcome.name'] == 'Blocked']['location.x'], 
    y=data_merge[data_merge['shot.outcome.name'] == 'Blocked']['location.y'],    
    name='Blocked',
    mode='markers',
    marker_color='#FCC92B',
    customdata=data_merge[data_merge['shot.outcome.name'] == 'Blocked'][['player.name', 'TimeInPoss', 'DistToGoal','minute']],
        text=data_merge[data_merge['shot.outcome.name'] == 'Blocked']['shot.body_part.name'],
        hovertemplate='<br><br>Player: %{customdata[0]}<br>Time in Possession (ms): %{customdata[1]}<br>Distance to goal (m): %{customdata[2]}<br>Body part: %{text}<br>Game minute: %{customdata[3]}'
    ))
fig.add_trace(go.Scatter(
    x=data_merge[data_merge['shot.outcome.name'] == 'Saved']['location.x'], 
    y=data_merge[data_merge['shot.outcome.name'] == 'Saved']['location.y'],    
    name='Saved',
    mode='markers',
    marker_color='#FD5109',
    customdata=data_merge[data_merge['shot.outcome.name'] == 'Saved'][['player.name', 'TimeInPoss', 'DistToGoal','minute']],
        text=data_merge[data_merge['shot.outcome.name'] == 'Saved']['shot.body_part.name'],
        hovertemplate='<br><br>Player: %{customdata[0]}<br>Time in Possession (ms): %{customdata[1]}<br>Distance to goal (m): %{customdata[2]}<br>Body part: %{text}<br>Game minute: %{customdata[3]}'
    ))


# update traces
fig.update_traces(mode='markers', marker_size=12,marker_line_width=0.5)

# update layout
fig.update_layout(plot_bgcolor='#009643', 
                xaxis=dict(gridcolor='#009643',range=[0, 120],showticklabels=False), 
                height=600,
                width=800,
                paper_bgcolor='#F1F0DA',
                yaxis=dict(range=[0, 80], gridcolor='#009643',showticklabels=False),
                annotations=
                [dict(x=0.5,y=1.15,xref='paper',yref='paper',text='Shot Locations and Outcomes',showarrow=False,
                font=dict(family='Arial',size=18,color='black')),
                dict(x=0.5,y=1.08,xref='paper',yref='paper',text="Women's World Cup 2019 - USA vs Netherlands",showarrow=False,
                font=dict(family='Arial',size=14,color='black')),
                dict(x=0.1,y=-0.10,xref='paper',yref='paper',text='Netherlands Team',showarrow=False,
                font=dict(family='Arial',size=18,color='black')),
                dict(x=0.9,y=-0.10,xref='paper',yref='paper',text='United States Team',showarrow=False,
                font=dict(family='Arial',size=18,color='black')),
                dict(x=1.16,y=-0.18,xref='paper',yref='paper',text='Data source: StatsBomb [3]',showarrow=False,
                font=dict(family='Arial',size=10,color='black'))
                ])

# add shapes
fig.add_shape(type='line', x0=60, y0=0, x1=60, y1=80, line=dict(color='white', width=0.6))
fig.add_shape(type='rect',
              x0=0, y0=0, x1=120, y1=80,
              line=dict(color='white', width=0.5))
fig.add_shape(type='rect',
              x0=0, y0=0, x1=60, y1=80,
              line=dict(color='white', width=0.6))
fig.add_shape(type='rect',
              x0=0, y0=18, x1=18, y1=62,
              line=dict(color='white', width=0.6))
fig.add_shape(type='rect',
              x0=102, y0=18, x1=120, y1=62,
              line=dict(color='white', width=0.6))
fig.add_shape(type='rect',
              x0=0, y0=30, x1=6, y1=50,
              line=dict(color='white', width=0.6))
fig.add_shape(type='rect',
              x0=114, y0=30, x1=120, y1=50,
              line=dict(color='white', width=0.6))
fig.add_shape(type='rect',
              x0=0, y0=36, x1=0.5, y1=44,
              line=dict(color='white', width=0.6))
fig.add_shape(type='rect',
              x0=120, y0=36, x1=119.5, y1=44,
              line=dict(color='white', width=0.6))

fig.add_shape(type='circle',
              x0=49, y0=29, x1=71, y1=50,
              line=dict(color='white', width=0.6)
             )

fig.show()

Figure 8: Overview of Shots and Outcomes at the 2019 Women’s World Cup Final

Overall, the plot clearly shows that more shots were taken by US players than Netherland’s players. You can also see that no shots on the Netherland’s side resulted in a goal while two shots taken by US players resulted in a goal. This is in line with the final score of the game 2-0 (US vs Netherlands).

An interesting observation was that the majority of the Netherland’s players who took a shot had very little time with the ball compared to US players. On average, Netherlands players who took a shot were in possession of the ball for only 9 ms while US players were in possession of the ball for 23 ms on average. Most notably, Alex Morgan who took a shot that ended up outside the posts had the longest possession of the ball. Because Alex Morgan is a prominent US soccer player, visualizing her moves on the field is key. For this game, she shot with her left foot every time, and always in the second half of the game. In fact, although no shot of hers results in a goal, Alex Morgan made up over a quarter of all shots taken during the final; a very high percentage and indicative of her impressive skills on the field.

With this plot in mind, we are excited to see what the 2023 World Cup final has in store. Which team will ultimately take the crown as the 2023 Women’s FIFA champions?

Conclusion

Summary of Plots

We began this analysis by looking into which terms are the most associated with the FIFA World Cup. However, the main terms appear to be associated more with men’s FIFA. Therefore, we decided to explore the differences between funding men’s and women’s FIFA World Cup based on attendance. Although total men’s FIFA attendance is higher, men’s and women’s FIFA are closer when looking at the average and highest attendance. Based on average and highest attendance, there exists a massive discrepancy between the genders. Furthermore, we wanted to see if the financial difference reflected in the tournament over the years. Thus, Figure 3 depicts individual teams’ development and potential for growth. Knowing the rankings over the years, we looked into how the financial discrepancies affect the venues for different tournaments. Based on the four main women’s tournaments, it was found that when it comes to the location of the stadiums, there is a trend that many of the football stadiums that host women’s tournaments are located close to large cities. This is likely due to the lack of funds to afford the venue since we know from Figure 2 that it is not due to low attendance.

Looking back at our first figure, the plot showed Mbappé to be one of the terms most associated with FIFA, which according to Forbes, is the highest-paid football player in the world 1. Therefore, we decided to look into the key female players heading into this next World Cup. Figure 5 shows an interactive globe to discover where the key female players come from. Alex Morgan, who is ranked 38th and plays for the United States, has become very well known due to her talent in recent years. We can explore more of her skills in Figure 6 by seeing how many goals she scored in the 2019 World Cup, when she typically scores her goals throughout a game, and more.

Final Thoughts

Despite the women’s World Cup being the world’s largest women’s sporting tournament, finding concise information and digestible visualizations is extremely challenging compared to men’s football. Therefore, this analysis hopes to catalyze an investigation into the world of women’s football and visually explore its essential aspects. All of the visualizations can be helpful tools for team managers, betting agencies, or football fans that want to explore previous trends and predict the trends for the women’s FIFA World Cup 2023. We explored with this analysis the strengths and weaknesses of different teams, players, and tournaments, as well as the overall financial discrepancy in the men’s and women’s FIFA tournaments. We hope that your main takeaway from this analysis is how underrated, underfunded, and under-analyzed women’s football. In the next couple of years, we hope that the coverage and spending for women’s football will increase. Currently, it does seem as if FIFA is trying to increase coverage of the women’s World Cup. In fact, in 2021, they finally published their first ever analysis of the landscape that is women’s football 7. With trends showing the growth that women’s football has made in the past 20+ years, one can hope that there will no longer exist a discrepancy between the men’s and women’s football in the future.

References

[1] Birnbaum, J. (2022, October 7). The World’s Highest-Paid Soccer Players 2022: Kylian Mbappé Claims No. 1 While Erling Haaland Debuts. Forbes. https://www.forbes.com/sites/justinbirnbaum/2022/10/07/the-worlds-highest-paid-soccer-players-2022-kylian-mbapp-claims-no-1-while-erling-haaland-debuts/?sh=360a1ee8629d

[2] “ESPN FC Women’s Rank: The 50 Best Footballers in the World Today.” ESPN, ESPN Internet Ventures, 27 June 2022, https://www.espn.com/soccer/blog-espn-fc-united/story/4685632/espn-fc-womens-rank-the-50-best-footballers-in-the-world-today.

[3] StatsBomb. (2023, April 11). StatsBomb data: Event data. StatsBomb. Retrieved April 24, 2023, from https://statsbomb.com/what-we-do/hub/free-data/

[4] Wikipedia. (2023, April 16). Latitude and longitude of cities. Wikipedia. Retrieved April 24, 2023, from https://en.wikipedia.org/wiki/Geographic_coordinate_system

[5] Twitter. (2019). Get Twitter API. Twitter. Retrieved April 24, 2023, from https://api.twitter.com/2/tweets

[6] FIFA. (2022). Finances. Retrieved April 24, 2023, from https://www.fifa.com/about-fifa/organisation/finances

[7] FIFA. (2022, March 22). FIFA publishes first-ever comprehensive analysis of the elite women’s football landscape. FIFA.com. https://www.fifa.com/media-releases/fifa-publishes-first-ever-comprehensive-analysis-of-the-elite-women-s-football-l

[8] Jürisoo, M. (2022, August 1). Women’s international football results. Kaggle. Retrieved April 24, 2023, from https://www.kaggle.com/datasets/martj42/womens-international-football-results

[9] Published by Statista Research Department, & 18, J. (2023, January 18). Highest paid footballers worldwide 2022. Statista. Retrieved May 2, 2023, from https://www.statista.com/statistics/266636/best-paid-soccer-players-in-the-2009-2010-season/#:~:text=As%20of%20December%202022%2C%20Cristiano,dollars%20in%20off%2Dfield%20income.

[10] Gorostieta, D. (2023, March 14). Who are the highest-paid women’s soccer players in the world? Diario AS. Retrieved May 2, 2023, from https://en.as.com/soccer/who-are-the-highest-paid-womens-soccer-players-in-the-world-n/

Appendix

Our Color Scheme

We crafted our color palette based of the 2023 FIFA World Cup colors seen at the top of this webpage. Below is the exact color palette used throughout our visualizations. We picked and chose colors from this palette on a case by case basis depending on whether a plot required a discrete or continuous color palette in addition to how many colors where needed overall for a plot.