OLS vs. Logit

Comparing model results

Data: clean_titanic.csv

  • Create two Python scripts, one with OLS modeling, another with Logit modeling.
  • What are the differences in the results?

OLS

import pandas as pd
import numpy as np
import statsmodels.api as sm

def logit_to_probability(logit):
  return 1 / (1 + np.exp(-logit))

def main():
    data = pd.read_csv("data/clean_titanic.csv")

    X = data[["fare", "sex", "age", 
          "siblings_spouses_aboard",
          "parents_children_aboard"]]
    y = data["survived"]

    model = sm.OLS(y, sm.add_constant(X)).fit()
    print(model.summary())

    coeffs = pd.DataFrame(model.params)
    print(coeffs)
    coeffs_prob = coeffs[0].apply(logit_to_probability)
    print(coeffs_prob)

main()

Logit

import pandas as pd
import numpy as np
import statsmodels.api as sm

def logit_to_probability(logit):
  return 1 / (1 + np.exp(-logit))

def main():
    data = pd.read_csv("data/clean_titanic.csv")

    X = data[["fare", "sex", "age", 
          "siblings_spouses_aboard",
          "parents_children_aboard"]]
    y = data["survived"]

    model = sm.Logit(y, sm.add_constant(X)).fit()
    print(model.summary())

    coeffs = pd.DataFrame(model.params)
    print(coeffs)
    coeffs_prob = coeffs[0].apply(logit_to_probability)
    print(coeffs_prob)

main()

Spam vs. Ham

Run both OLS and Logit on this email data

  • Download the data
  • Clean the data
  • What’s the target/response variable? What are the features/predictors?
  • Model the data