Review

Data

Download data on movies and their rating, and set up your analysis environment.

library(tidyverse)
movies <- read_csv("data/movies.csv")
  • What variables are there in the data?
  • What are their types?

Dependent variable: Rating

Step 1: Histogram

movies |>
  ggplot(aes(x = averageRating)) +
  geom_histogram()

What can we say about the distribution of the rating variable?

Independent variable: type

Boxplot

movies |>
  ggplot(aes(x = averageRating, y = type)) +
  geom_boxplot()

Descriptive Statistics

movies |>
  group_by(type) |>
  summarize(mean = mean(averageRating),
            min = min(averageRating),
            max = max(averageRating),
            n = n())

Questions

What questions can we answer?

Rating vs. Type

What is the effect of type on rating?

rating_by_type <- lm(averageRating ~ type, data = movies)
summary(rating_by_type)

Call:
lm(formula = averageRating ~ type, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3704 -0.6462  0.1296  0.7687  3.8538 

Coefficients:
                 Estimate Std. Error  t value Pr(>|t|)    
(Intercept)      6.146239   0.002288 2685.794  < 2e-16 ***
typeshort        0.663780   0.003954  167.889  < 2e-16 ***
typetvEpisode    1.224139   0.002915  419.962  < 2e-16 ***
typetvMiniSeries 0.952108   0.012474   76.328  < 2e-16 ***
typetvMovie      0.417639   0.006182   67.558  < 2e-16 ***
typetvSeries     0.685069   0.005765  118.826  < 2e-16 ***
typetvShort      0.653040   0.026420   24.718  < 2e-16 ***
typetvSpecial    0.766563   0.013785   55.608  < 2e-16 ***
typevideo        0.361448   0.006195   58.348  < 2e-16 ***
typevideoGame    0.616535   0.105955    5.819 5.93e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.24 on 1081851 degrees of freedom
Multiple R-squared:  0.1455,    Adjusted R-squared:  0.1455 
F-statistic: 2.047e+04 on 9 and 1081851 DF,  p-value: < 2.2e-16

Rating vs. Type

What is the effect of type on rating?

library(effects)

effect("type", rating_by_type) |>
  data.frame() |>
  ggplot(aes(x = fit, xmin = lower, xmax = upper,
             y = reorder(type, fit), label = round(fit, 2))) +
  geom_errorbar() +
  geom_label()

Independent variable: year

Scatter plot

movies |>
  ggplot(aes(y = averageRating, x = year)) +
  geom_point()

Rating vs. year

What is the effect of year on rating?

rating_by_year <- lm(averageRating ~ year, data = movies)
summary(rating_by_year)

Call:
lm(formula = averageRating ~ year, data = movies)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0645 -0.7201  0.1972  0.9077  3.6386 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.121e+01  1.157e-01  -96.84   <2e-16 ***
year         9.024e-03  5.781e-05  156.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.326 on 1081859 degrees of freedom
Multiple R-squared:  0.02202,   Adjusted R-squared:  0.02202 
F-statistic: 2.436e+04 on 1 and 1081859 DF,  p-value: < 2.2e-16

Rating vs. Year

What is the effect of year on rating?

library(effects)

effect("year", rating_by_year) |>
  data.frame() |>
  ggplot(aes(y = fit, ymin = lower, ymax = upper,
             x = year, label = round(fit, 2))) +
  geom_errorbar() +
  geom_label()