- Altair project with World Cup data
- Equivalent R project
November 29 2022
We will be using the FIFA World Cup dataset from Kaggle.
Download starter project
Altair works with jupyter notebooks (you need to have both altair and jupyter modules to run the starter project).
You have a notebook with the data loaded. Let’s start by inspecting the data:
You will need to drop duplicates from the data.
Our first plot will be a scatterplot that answers the following question: Has attendance changed over the years?
alt.Chart(word_cup_data).mark_point().encode( x = ... y = ... )
alt.Chart(world_cup_data).mark_point().encode( alt.X("Year:Q", scale=alt.Scale(zero=False), axis=alt.Axis(format="d")), y = "Attendance:Q", )
Our next question is the following: Which countries have won the FIFA World Cup, and how many times?
To answer this question we need to first create a new column called
winner that takes the name of the team with the most goals in the match.
add_one = lambda x : x + 1 add_one(5)
We first need to write a function that takes in a data row as a parameter and return the
row['Home Team Name'] if
row['Home Team Goals'] > row['Away Team Goals'] and
row['Away Team Name'] if
row['Away Team Goals'] > row['Home Team Goals'], else (it’s a tie) you can return
We now need to call the function we created in an apply() methods (from pandas), saving the returned values as a column in our dataframe.
world_cup_data.apply(lambda row: get_winner(row), axis=1)
We are interested in the final matches only. So we can filter our data by
world_cup_data[world_cup_data.Stage == "Final"]
We can select the column we created,
winner, and than count how many times each value shows up in the column using
world_cup_data[world_cup_data.Stage == "Final"]["winner"].value_counts()
We will need to get the column names and add it as a variable to our data to be able to plot it.
winner_count["country"] = winner_count.index
alt.Chart(winner_count).mark_bar().encode( y = alt.Y(...), x = alt.X(...) )