Logistic Regression

Binary Outcome – statistical model used for binary classification problems
Predicts probability of an event occurring (between 0 and 1)
Uses the logistic/sigmoid function to transform predictions

The Logistic Function

S-shaped curve (sigmoid)
Transforms any input into a probability between 0 and 1
Formula: \(p = \frac{1}{1 + e^{-z}}\) where z is the linear predictor \(\beta_0 + \beta_1 * x\)

Assumptions

Independent observations
No multicollinearity among predictors
Linear relationship between target (in log odds) and predictors
Adequate sample size

Data

Download Tagliamonte’s that expresion data and inspect it

Phenomenom:

I think that I will have a great time
I think I will have a great time

Question: What linguistics and social factor affect the expression of that

Data wrangling

Make every variable in Tagliamonte’s that expresion data a numeric variable
Call your python script data_wrangling.py
Your script should read data/that-expression.csv and write data/clean-that-expression.csv
Submit it to gradescope

assert isinstance(data.iloc[:,0][0], np.int64)
assert isinstance(data.iloc[:,1][0], np.int64)
assert isinstance(data.iloc[:,2][0], np.int64)

Logistic Regression with statsmodels

Download clean-that-expression.csv

import statsmodels.api as sm 
import pandas as pd  

def main():
    # loading the training dataset  
    data = pd.read_csv("data/clean-that-expression.csv") 
    
    # defining the dependent and independent variables 
    X = data[["know"]] 
    y = data["expressed"]
    
    # building the model and fitting the data 
    log_reg = sm.Logit(y, sm.add_constant(X)).fit() 
    print(log_reg.summary()) 

main()