Intro to Data Modeling

Data Modeling

Inference vs. Prediction

  • Inference:
    • understanding relationships and testing hypotheses about how variables interact
    • explain the underlying mechanisms and relationships in the data
    • How does education relates to income?

Inference vs. Prediction

  • Prediction:
    • generating accurate estimates of future or unknown values
    • optimizing the model’s ability to make correct predictions
    • Accurately forecast someone’s income based on their education level.

Inference vs. Prediction

  • Inference models prioritize interpretability and often use simpler, more transparent methods like linear regression
  • Prediction models may use more complex “black box” approaches like neural networks if they improve accuracy

Inference vs. Prediction

  • In inference, you carefully choose variables based on theory and prior research
  • For prediction, you might include any feature that improves predictive performance, even if the relationship isn’t theoretically clear (! careful with this approach)

Inference vs. Prediction

  • Inference focuses on metrics like p-values, confidence intervals, and effect sizes
  • Prediction emphasizes metrics like mean squared error, classification accuracy, or the area under the ROC curve (AUC)

Inference vs. Prediction

Overfitting concerns

  • both approaches need to address overfitting
  • it’s especially critical in prediction
  • inference models might accept slightly worse fit for better interpretability
  • prediction models focus heavily on cross-validation and out-of-sample performance

Inference

For interpreting statistical models, we will be using the statsmodels Python package

Ordinary Least Squares (OLS) regression

OLS is a method that finds the best-fitting straight line through a set of points by minimizing the sum of squared vertical distances between the data points and the line.

The goal is to find a line y = β₀ + β₁x that best fits your data, where:

  • β₀ is the y-intercept
  • β₁ is the slope
  • x is your independent variable – or feature(s)
  • y is your dependent variable – or target

Ordinary Least Squares (OLS) regression

For each data point, OLS:

  • Calculates the vertical distance (residual) between the actual y-value and the predicted y-value
  • Squares these distances (to make negatives positive and penalize larger errors more)
  • Sums all squared distances
  • Finds the line that minimizes this sum

Ordinary Least Squares (OLS) regression

Key Assumptions:

  • Linear relationship between variables
  • Independent observations
  • Homoscedasticity (constant variance in errors)
  • Normally distributed errors
  • No perfect multicollinearity

Ordinary Least Squares (OLS) regression

Advantages:

  • Simple to understand and implement
  • Best Linear Unbiased Estimator (BLUE) under certain conditions
  • Computationally efficient
  • Clear interpretation of results

Ordinary Least Squares (OLS) regression

Limitations:

  • Sensitive to outliers
  • Assumes linear relationships
  • May not capture complex patterns
  • All predictors must be independent

OLS regression in statsmodels

import pandas as pd
import statsmodels.api as sm


def main():
    data = pd.read_csv("data/clean_titanic.csv")
    X = data[["pclass", "sex"]]  
    y = data["survived"] 

    # Ordinary Least Squares (regression)
    result = sm.OLS(y, sm.add_constant(X)).fit()
    print(result.summary())

main()

OLS regression results

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.366
Model:                            OLS   Adj. R-squared:                  0.365
Method:                 Least Squares   F-statistic:                     255.2
Date:                Tue, 11 Feb 2025   Prob (F-statistic):           3.19e-88
Time:                        14:23:34   Log-Likelihood:                -417.77
No. Observations:                 887   AIC:                             841.5
Df Residuals:                     884   BIC:                             855.9
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================

OLS regression results

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0825      0.040     26.795      0.000       1.003       1.162
pclass        -0.1577      0.016    -10.029      0.000      -0.189      -0.127
sex           -0.5161      0.027    -18.776      0.000      -0.570      -0.462
==============================================================================
Omnibus:                       40.150   Durbin-Watson:                   1.921
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               44.755
Skew:                           0.549   Prob(JB):                     1.91e-10
Kurtosis:                       3.060   Cond. No.                         9.07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

OLS regression results

  • What other variables can you add to the model?
  • How does the model change?

OLS regression results

Run OLS regression on this kaggle dataset on house prices