Overfitting concerns
For interpreting statistical models, we will be using the statsmodels
Python package
OLS is a method that finds the best-fitting straight line through a set of points by minimizing the sum of squared vertical distances between the data points and the line.
The goal is to find a line y = β₀ + β₁x that best fits your data, where:
For each data point, OLS:
Key Assumptions:
Advantages:
Limitations:
OLS Regression Results
==============================================================================
Dep. Variable: survived R-squared: 0.366
Model: OLS Adj. R-squared: 0.365
Method: Least Squares F-statistic: 255.2
Date: Tue, 11 Feb 2025 Prob (F-statistic): 3.19e-88
Time: 14:23:34 Log-Likelihood: -417.77
No. Observations: 887 AIC: 841.5
Df Residuals: 884 BIC: 855.9
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0825 0.040 26.795 0.000 1.003 1.162
pclass -0.1577 0.016 -10.029 0.000 -0.189 -0.127
sex -0.5161 0.027 -18.776 0.000 -0.570 -0.462
==============================================================================
Omnibus: 40.150 Durbin-Watson: 1.921
Prob(Omnibus): 0.000 Jarque-Bera (JB): 44.755
Skew: 0.549 Prob(JB): 1.91e-10
Kurtosis: 3.060 Cond. No. 9.07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Run OLS regression on this kaggle dataset on house prices