More Regression

Data

Dowload data on the world soccer cup and set up your analysis environment.

library(tidyverse)
library(effects)

world_cup <- read_csv("data/world-cup.csv")

What questions can we answer with these data?

Logistic Regression

  • Logistic regression is a statistical model used for binary classification problems
  • It models outcome/response/dependent variable with two possible values (like yes/no, pass/fail, win/lose)
  • Unlike linear regression which predicts continuous values, logistic regression estimates the probability that an instance belongs to a particular category
  • Logistic regression uses the sigmoid function to return a probability value between 0 and 1

Logistic Regression

To run logistic regression, we need to transform our response variable into a numeric variable:

world_cup <- world_cup |>
  mutate(response = if_else(win == "yes", 1, 0))

The 1/0 binary variable is handy to calculate percentages:

world_cup |>
  group_by(team_name) |>
  summarize(percent_win = mean(response)) |>
  arrange(-percent_win)

Logistic Regression

Once we have our 1/0 response variable, we run the glm (generalized linear model) with family = binomial:

model <- glm(response ~ team_name,
            data = world_cup, family = binomial)
summary(model)

Call:
glm(formula = response ~ team_name, family = binomial, data = world_cup)

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)                       -0.91629    0.59161  -1.549  0.12143   
team_nameAngola                    1.60944    1.36015   1.183  0.23670   
team_nameArgentina                 1.60944    0.63683   2.527  0.01150 * 
team_nameAustralia                 0.10536    0.84327   0.125  0.90057   
team_nameAustria                   0.98528    0.69864   1.410  0.15845   
team_nameBelgium                   0.86977    0.66564   1.307  0.19132   
team_nameBolivia                  -0.69315    1.24499  -0.557  0.57770   
team_nameBosnia and Herzegovina  -15.64978 1385.37792  -0.011  0.99099   
team_nameBrazil                    1.73727    0.62740   2.769  0.00562 **
team_nameBulgaria                  0.10536    0.72839   0.145  0.88499   
team_nameCameroon                  0.47446    0.72975   0.650  0.51559   
team_nameCanada                  -15.64978 1385.37792  -0.011  0.99099   
team_nameChile                     0.79851    0.68415   1.167  0.24315   
team_nameChina PR                -15.64978 1385.37792  -0.011  0.99099   
team_nameColombia                  0.91629    0.74162   1.236  0.21663   
team_nameCosta Rica                0.55962    0.76997   0.727  0.46735   
team_nameCôte d'Ivoire           -15.64978  799.84846  -0.020  0.98439   
team_nameCroatia                   0.91629    0.77460   1.183  0.23684   
team_nameCuba                      0.22314    1.36015   0.164  0.86969   
team_nameCzech Republic            0.22314    1.36015   0.164  0.86969   
team_nameCzechoslovakia            0.91629    0.69522   1.318  0.18751   
team_nameDenmark                   1.16761    0.77715   1.502  0.13299   
team_nameDutch East Indies       -15.64978 2399.54479  -0.007  0.99480   
team_nameEcuador                   0.51083    0.87560   0.583  0.55962   
team_nameEgypt                     0.91629    1.16190   0.789  0.43034   
team_nameEl Salvador             -15.64978  979.61021  -0.016  0.98725   
team_nameEngland                   1.17580    0.64468   1.824  0.06817 . 
team_nameFrance                    1.21354    0.64578   1.879  0.06022 . 
team_nameGerman DR                 1.60944    1.04881   1.535  0.12490   
team_nameGermany                   1.90669    0.67490   2.825  0.00473 **
team_nameGermany FR                1.30833    0.64578   2.026  0.04277 * 
team_nameGhana                     0.91629    0.82664   1.108  0.26767   
team_nameGreece                    0.73397    0.84656   0.867  0.38594   
team_nameHaiti                   -15.64978 1385.37792  -0.011  0.99099   
team_nameHonduras                 -0.33647    0.99642  -0.338  0.73560   
team_nameHungary                   0.91629    0.68920   1.329  0.18368   
team_nameIR Iran                  -0.69315    1.24499  -0.557  0.57770   
team_nameIran                     -0.69315    1.24499  -0.557  0.57770   
team_nameIraq                    -15.64978 1385.37792  -0.011  0.99099   
team_nameIsrael                    1.60944    1.36015   1.183  0.23670   
team_nameItaly                     1.43355    0.63363   2.262  0.02367 * 
team_nameJamaica                   0.22314    1.36015   0.164  0.86969   
team_nameJapan                     0.04082    0.79582   0.051  0.95909   
team_nameKorea DPR                -0.87547    1.23153  -0.711  0.47716   
team_nameKorea Republic            0.31845    0.70065   0.455  0.64946   
team_nameKuwait                    0.22314    1.36015   0.164  0.86969   
team_nameMexico                    0.54160    0.65323   0.829  0.40704   
team_nameMorocco                   0.10536    0.84327   0.125  0.90057   
team_nameNetherlands               1.36828    0.65416   2.092  0.03647 * 
team_nameNew Zealand               0.22314    1.04881   0.213  0.83151   
team_nameNigeria                   0.14310    0.77045   0.186  0.85265   
team_nameNorthern Ireland          1.38629    0.82158   1.687  0.09154 . 
team_nameNorway                    1.42712    0.93986   1.518  0.12890   
team_nameParaguay                  0.54160    0.70951   0.763  0.44526   
team_namePeru                      0.51083    0.79232   0.645  0.51911   
team_namePoland                    1.24171    0.69461   1.788  0.07383 . 
team_namePortugal                  1.38629    0.71589   1.936  0.05281 . 
team_nameRepublic of Ireland     -15.64978  665.51423  -0.024  0.98124   
team_nameRomania                   1.01160    0.73547   1.375  0.16899   
team_nameRussia                    0.22314    0.92195   0.242  0.80875   
team_nameSaudi Arabia              0.10536    0.84327   0.125  0.90057   
team_nameScotland                 -0.12516    0.75861  -0.165  0.86895   
team_nameSenegal                   1.32176    1.08781   1.215  0.22434   
team_nameSerbia                    0.22314    1.36015   0.164  0.86969   
team_nameSerbia and Montenegro   -15.64978 1385.37792  -0.011  0.99099   
team_nameSlovakia                  0.91629    1.16190   0.789  0.43034   
team_nameSlovenia                 -0.69315    1.24499  -0.557  0.57770   
team_nameSouth Africa              0.22314    0.92195   0.242  0.80875   
team_nameSoviet Union              1.24171    0.69461   1.788  0.07383 . 
team_nameSpain                     1.22378    0.64762   1.890  0.05880 . 
team_nameSweden                    0.82928    0.66115   1.254  0.20973   
team_nameSwitzerland               0.55962    0.68661   0.815  0.41505   
team_nameTogo                    -15.64978 1385.37792  -0.011  0.99099   
team_nameTrinidad and Tobago     -15.64978 1385.37792  -0.011  0.99099   
team_nameTunisia                  -0.18232    0.89132  -0.205  0.83792   
team_nameTurkey                    1.32176    0.87560   1.510  0.13116   
team_nameUkraine                   1.32176    1.08781   1.215  0.22434   
team_nameUnited Arab Emirates    -15.64978 1385.37792  -0.011  0.99099   
team_nameUruguay                   0.83933    0.65348   1.284  0.19900   
team_nameUSA                       0.31015    0.69194   0.448  0.65398   
team_nameWales                     2.30259    1.26491   1.820  0.06871 . 
team_nameYugoslavia                1.18822    0.67832   1.752  0.07982 . 
team_nameZaire                   -15.64978 1385.37792  -0.011  0.99099   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2362.1  on 1703  degrees of freedom
Residual deviance: 2158.3  on 1621  degrees of freedom
AIC: 2324.3

Number of Fisher Scoring iterations: 15

Logistic Regression

We have to calculate the \(R^2\) manually:

1 - model$deviance/model$null.deviance
[1] 0.08625239

We can also run anova on the model:

anova(model)

Logistic Regression

And get the effects:

effect("team_name", model) |>
  data.frame() |>
  arrange(-fit)