Case Studies

Penguins Data

Download the penguin and set up your analysis environment:

library(tidyverse)
penguins <- read_csv("data/penguins.csv")

Two categorical variables – contingency table

my_penguin_table <- table(penguins$species, penguins$island)
addmargins(my_penguin_table)
           
            Biscoe Dream Torgersen Sum
  Adelie        44    56        52 152
  Chinstrap      0    68         0  68
  Gentoo       124     0         0 124
  Sum          168   124        52 344
  1. What is the probability of a randomly selected penguin being Adelie if we know the penguin is on Biscoe island?
  2. What is the probability of a randomly selected penguin being on Torgersen island if we know the penguin is Adelie?

Conditional probabilities

           
            Biscoe Dream Torgersen Sum
  Adelie        44    56        52 152
  Chinstrap      0    68         0  68
  Gentoo       124     0         0 124
  Sum          168   124        52 344
  1. What is the probability of a randomly selected penguin being Adelie if we know the penguin is on Biscoe island? P(Adelie|Biscoe) = P(Adelie \(\cap\) Biscoe) / P(Biscoe) = 44/168
  2. What is the probability of a randomly selected penguin being on Torgersen island if we know the penguin is Adelie? P(Torgersen|Adelie) = P(Adelie \(\cap\) Torgersen) / P(Adelie) = 52/152

Palm tree data

Dowload the palm tree data

palmtrees <- read_csv("data/palmtrees.csv")
my_palmtree_table <- table(palmtrees$climbing, palmtrees$stem_solitary)
addmargins(my_palmtree_table)
              
               both non-solitary solitary  Sum
  both            2            5        9   16
  climbing      272         1108      428 1808
  non-climbing   37           47      274  358
  Sum           311         1160      711 2182
  1. What is the probability that a palm tree is only solitary?
  2. What is the probability that a palm tree is only solitary if we know the tree is a climbing tree?
  3. What is the probability of a palm tree being only solitary and only climbing?

Probabilities

              
               both non-solitary solitary  Sum
  both            2            5        9   16
  climbing      272         1108      428 1808
  non-climbing   37           47      274  358
  Sum           311         1160      711 2182
  1. What is the probability that a palm tree is only solitary? P(solitary) = 711/2182
  2. What is the probability that a palm tree is only solitary if we know the tree is a climbing tree? P(solitary|climbing) = 428/1808
  3. What is the probability of a palm tree being only solitary and only climbing? P(solitary \(\cap\) climbing) = 428/2182

Email data

Download email data

email <- read_csv("data/email.csv")
my_email_table <- table(email$spam, email$winner)
addmargins(my_email_table)
     
        no  yes  Sum
  0   3510   44 3554
  1    347   20  367
  Sum 3857   64 3921
  • What is the probability of an email being spam if we know the word winner is not in the message?
  • What is the probability of an email being spam if we know the word winner is in the message?

Conditional probabilities

     
        no  yes  Sum
  0   3510   44 3554
  1    347   20  367
  Sum 3857   64 3921
  • What is the probability of an email being spam if we know the word winner is not in the message? P(spam|winner) = 347/3857 = 0.0899663
  • What is the probability of an email being spam if we know the word winner is in the message? P(spam|no winner) = 20/64 = 0.3125
  • P(spam) = 367/3921 = 0.09359857

Logistic regression

email_model <- glm(spam ~ winner, data = email, family = binomial)
summary(email_model)

Call:
glm(formula = spam ~ winner, family = binomial, data = email)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.31405    0.05627 -41.121  < 2e-16 ***
winneryes    1.52559    0.27549   5.538 3.06e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2437.2  on 3920  degrees of freedom
Residual deviance: 2412.7  on 3919  degrees of freedom
AIC: 2416.7

Number of Fisher Scoring iterations: 5
library(effects)
effect("winner", email_model)

 winner effect
winner
       no       yes 
0.0899663 0.3125000 

Multiclass logistic regression

library(nnet)
model_penguins <- multinom(species ~ island, data = penguins)
# weights:  12 (6 variable)
initial  value 377.922627 
iter  10 value 184.355764
iter  20 value 181.998211
final  value 181.975950 
converged
effect("island", model_penguins)

island effect (probability) for Adelie
island
   Biscoe     Dream Torgersen 
0.2618842 0.4515092 0.9999974 

island effect (probability) for Chinstrap
island
      Biscoe        Dream    Torgersen 
2.542366e-06 5.484908e-01 8.273478e-08 

island effect (probability) for Gentoo
island
      Biscoe        Dream    Torgersen 
7.381133e-01 3.638939e-10 2.494737e-06