November 29 2022

Agenda

  • Altair project with World Cup data
  • Equivalent R project

Case Study: World Cup Data

Data and Starter Project

Altair

Altair works with jupyter notebooks (you need to have both altair and jupyter modules to run the starter project).

You have a notebook with the data loaded. Let’s start by inspecting the data:

world_cup_data.head()
world_cup_data.shape

You will need to drop duplicates from the data.

world_cup_data.drop_duplicates()

Altair Scatterplot

Our first plot will be a scatterplot that answers the following question: Has attendance changed over the years?

alt.Chart(word_cup_data).mark_point().encode(
  x = ...
  y = ...
)

Changing scale and axis

alt.Chart(world_cup_data).mark_point().encode(
    alt.X("Year:Q", 
           scale=alt.Scale(zero=False),
           axis=alt.Axis(format="d")),
    y = "Attendance:Q",
)

Anonymous Functions

Our next question is the following: Which countries have won the FIFA World Cup, and how many times?

To answer this question we need to first create a new column called winner that takes the name of the team with the most goals in the match.

Similar to what we’ve done in JavaScript, we can create anonymous functions in Python:

add_one = lambda x : x + 1
add_one(5)

Function that returns value

We first need to write a function that takes in a data row as a parameter and return the row['Home Team Name'] if row['Home Team Goals'] > row['Away Team Goals'] and row['Away Team Name'] if row['Away Team Goals'] > row['Home Team Goals'], else (it’s a tie) you can return row['Win conditions'].

Apply lambda

We now need to call the function we created in an apply() methods (from pandas), saving the returned values as a column in our dataframe.

world_cup_data.apply(lambda row: get_winner(row), axis=1)

Filter Data

We are interested in the final matches only. So we can filter our data by Stage:

world_cup_data[world_cup_data.Stage == "Final"]

Filter and Aggregate

We can select the column we created, winner, and than count how many times each value shows up in the column using value_counts()

world_cup_data[world_cup_data.Stage == "Final"]["winner"].value_counts()

We will need to get the column names and add it as a variable to our data to be able to plot it.

winner_count["country"] = winner_count.index

Creating a bar plot

alt.Chart(winner_count).mark_bar().encode(
    y = alt.Y(...),
    x = alt.X(...)
)