- Altair project with World Cup data
- Equivalent R project
November 29 2022
We will be using the FIFA World Cup dataset from Kaggle.
Altair works with jupyter notebooks (you need to have both altair and jupyter modules to run the starter project).
You have a notebook with the data loaded. Let’s start by inspecting the data:
world_cup_data.head() world_cup_data.shape
You will need to drop duplicates from the data.
world_cup_data.drop_duplicates()
Our first plot will be a scatterplot that answers the following question: Has attendance changed over the years?
alt.Chart(word_cup_data).mark_point().encode( x = ... y = ... )
alt.Chart(world_cup_data).mark_point().encode( alt.X("Year:Q", scale=alt.Scale(zero=False), axis=alt.Axis(format="d")), y = "Attendance:Q", )
Our next question is the following: Which countries have won the FIFA World Cup, and how many times?
To answer this question we need to first create a new column called winner
that takes the name of the team with the most goals in the match.
Similar to what we’ve done in JavaScript, we can create anonymous functions in Python:
add_one = lambda x : x + 1 add_one(5)
We first need to write a function that takes in a data row as a parameter and return the row['Home Team Name']
if row['Home Team Goals'] > row['Away Team Goals']
and row['Away Team Name']
if row['Away Team Goals'] > row['Home Team Goals']
, else (it’s a tie) you can return row['Win conditions']
.
We now need to call the function we created in an apply() methods (from pandas), saving the returned values as a column in our dataframe.
world_cup_data.apply(lambda row: get_winner(row), axis=1)
We are interested in the final matches only. So we can filter our data by Stage
:
world_cup_data[world_cup_data.Stage == "Final"]
We can select the column we created, winner
, and than count how many times each value shows up in the column using value_counts()
world_cup_data[world_cup_data.Stage == "Final"]["winner"].value_counts()
We will need to get the column names and add it as a variable to our data to be able to plot it.
winner_count["country"] = winner_count.index
alt.Chart(winner_count).mark_bar().encode( y = alt.Y(...), x = alt.X(...) )