Intro to Data Modeling

Data Modeling

Inference vs. Prediction

Inference:
- understanding relationships and testing hypotheses about how variables interact
- explain the underlying mechanisms and relationships in the data
- How does education relates to income?

Inference vs. Prediction

Prediction:
- generating accurate estimates of future or unknown values
- optimizing the model’s ability to make correct predictions
- Accurately forecast someone’s income based on their education level.

Inference vs. Prediction

Inference models prioritize interpretability and often use simpler, more transparent methods like linear regression
Prediction models may use more complex “black box” approaches like neural networks if they improve accuracy

Inference vs. Prediction

In inference, you carefully choose variables based on theory and prior research
For prediction, you might include any feature that improves predictive performance, even if the relationship isn’t theoretically clear (! careful with this approach)

Inference vs. Prediction

Inference focuses on metrics like p-values, confidence intervals, and effect sizes
Prediction emphasizes metrics like mean squared error, classification accuracy, or the area under the ROC curve (AUC)

Inference vs. Prediction

Overfitting concerns

both approaches need to address overfitting
it’s especially critical in prediction
inference models might accept slightly worse fit for better interpretability
prediction models focus heavily on cross-validation and out-of-sample performance

Inference

For interpreting statistical models, we will be using the statsmodels Python package

Ordinary Least Squares (OLS) regression

OLS is a method that finds the best-fitting straight line through a set of points by minimizing the sum of squared vertical distances between the data points and the line.

The goal is to find a line y = β₀ + β₁x that best fits your data, where:

β₀ is the y-intercept
β₁ is the slope
x is your independent variable – or feature(s)
y is your dependent variable – or target

Ordinary Least Squares (OLS) regression

For each data point, OLS:

Calculates the vertical distance (residual) between the actual y-value and the predicted y-value
Squares these distances (to make negatives positive and penalize larger errors more)
Sums all squared distances
Finds the line that minimizes this sum

Ordinary Least Squares (OLS) regression

Key Assumptions:

Linear relationship between variables
Independent observations
Homoscedasticity (constant variance in errors)
Normally distributed errors
No perfect multicollinearity

Ordinary Least Squares (OLS) regression

Advantages:

Simple to understand and implement
Best Linear Unbiased Estimator (BLUE) under certain conditions
Computationally efficient
Clear interpretation of results

Ordinary Least Squares (OLS) regression

Limitations:

Sensitive to outliers
Assumes linear relationships
May not capture complex patterns
All predictors must be independent

OLS regression in statsmodels

import pandas as pd
import statsmodels.api as sm


def main():
    data = pd.read_csv("data/clean_titanic.csv")
    X = data[["pclass", "sex"]]  
    y = data["survived"] 

    # Ordinary Least Squares (regression)
    result = sm.OLS(y, sm.add_constant(X)).fit()
    print(result.summary())

main()

OLS regression results

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               survived   R-squared:                       0.366
Model:                            OLS   Adj. R-squared:                  0.365
Method:                 Least Squares   F-statistic:                     255.2
Date:                Tue, 11 Feb 2025   Prob (F-statistic):           3.19e-88
Time:                        14:23:34   Log-Likelihood:                -417.77
No. Observations:                 887   AIC:                             841.5
Df Residuals:                     884   BIC:                             855.9
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================

OLS regression results

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0825      0.040     26.795      0.000       1.003       1.162
pclass        -0.1577      0.016    -10.029      0.000      -0.189      -0.127
sex           -0.5161      0.027    -18.776      0.000      -0.570      -0.462
==============================================================================
Omnibus:                       40.150   Durbin-Watson:                   1.921
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               44.755
Skew:                           0.549   Prob(JB):                     1.91e-10
Kurtosis:                       3.060   Cond. No.                         9.07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

OLS regression results

What other variables can you add to the model?
How does the model change?

OLS regression results

Run OLS regression on this kaggle dataset on house prices