Logistic Regression

Logistic Regression

  • Binary Outcome – statistical model used for binary classification problems
  • Predicts probability of an event occurring (between 0 and 1)
  • Uses the logistic/sigmoid function to transform predictions

The Logistic Function

  • S-shaped curve (sigmoid)
  • Transforms any input into a probability between 0 and 1
  • Formula: \(p = \frac{1}{1 + e^{-z}}\) where z is the linear predictor \(\beta_0 + \beta_1 * x\)

Assumptions

  • Independent observations
  • No multicollinearity among predictors
  • Linear relationship between target (in log odds) and predictors
  • Adequate sample size

Data

Download Tagliamonte’s that expresion data and inspect it

Phenomenom:

  • I think that I will have a great time
  • I think I will have a great time

Question: What linguistics and social factor affect the expression of that

Data wrangling

  • Make every variable in Tagliamonte’s that expresion data a numeric variable
  • Call your python script data_wrangling.py
  • Your script should read data/that-expression.csv and write data/clean-that-expression.csv
  • Submit it to gradescope
assert isinstance(data.iloc[:,0][0], np.int64)
assert isinstance(data.iloc[:,1][0], np.int64)
assert isinstance(data.iloc[:,2][0], np.int64)

Logistic Regression with statsmodels

Download clean-that-expression.csv

import statsmodels.api as sm 
import pandas as pd  

def main():
    # loading the training dataset  
    data = pd.read_csv("data/clean-that-expression.csv") 
    
    # defining the dependent and independent variables 
    X = data[["know"]] 
    y = data["expressed"]
    
    # building the model and fitting the data 
    log_reg = sm.Logit(y, sm.add_constant(X)).fit() 
    print(log_reg.summary()) 

main()