Full Case Study – Emails

Setting up a new .qmd file

For this case study, we will do what I expect you to do for the final project.

We will start by creating a new .qmd file.

Click on the File option on the top menu, choose New File then click on Quarto Document...
Fill out title and author and click on Create

Download the email data and set it up in your project.

In your .qmd file, add a code chunk for the follow (shortcut: option + command + i for macs, crtl + alt + i for windows)

library(tidyverse)
library(effects)

email <- read_csv("data/email.csv")

You can check the data dictionary to understand what the variables are.

What factors have an effect on whether an email message is spam or not?

What percentage of the data is spam?

email |>
  summarize(mean(spam))

mean(spam)
0.0935986

What percentage of emails sent to multiple recipients spam?

email |>
  group_by(to_multiple) |>
  summarize(mean(spam))

to_multiple	mean(spam)
0	0.1075432
1	0.0193548

What other variables can we investigate?

What is the distribution of number of characters of spam vs. non-spam messages?

email |>
  ggplot(aes(x = factor(spam), y = num_char)) +
  geom_boxplot()

What other plots can we create?

What factors have an effect on whether an email message is spam or not?

model <- glm(spam ~ to_multiple + attach + num_char + number,
             data = email,
             family = binomial)
anova(model)

	Df	Deviance	Resid. Df	Resid. Dev	Pr(>Chi)
NULL	NA	NA	3920	2437.180	NA
to_multiple	1	65.16041	3919	2372.019	0.0000000
attach	1	8.82110	3918	2363.198	0.0029777
num_char	1	98.71806	3917	2264.480	0.0000000
number	2	123.19374	3915	2141.286	0.0000000

Variance explained

1 - model$deviance/model$null.deviance

[1] 0.1214081

Find a good model.

You can check the effects of each variable.

effect("number", model) |>
  data.frame() |>
  arrange(fit)

	number	fit	se	lower	upper
3	small	0.0481429	0.0042325	0.0404939	0.0571507
1	big	0.0925483	0.0130101	0.0700100	0.1213951
2	none	0.1854859	0.0190710	0.1509691	0.2257957