OLS Case Study

House Prices

import pandas as pd
import numpy as np
import statsmodels.api as sm

def main():
    data = pd.read_csv("data/clean_house_prices.csv")

    X = data[["bed", "house_size"]]
    y = data["price"]

    result = sm.OLS(y, sm.add_constant(X)).fit()
    print(result.summary())


main()

Results – part 1

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.242
Model:                            OLS   Adj. R-squared:                  0.242
Method:                 Least Squares   F-statistic:                     1623.
Date:                Sun, 16 Feb 2025   Prob (F-statistic):               0.00
Time:                        09:06:45   Log-Likelihood:            -1.4998e+05
No. Observations:               10154   AIC:                         3.000e+05
Df Residuals:                   10151   BIC:                         3.000e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================

Results – part 2

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       -2.26e+04   1.92e+04     -1.177      0.239   -6.02e+04     1.5e+04
bed        -2.675e+04   7072.066     -3.783      0.000   -4.06e+04   -1.29e+04
house_size   391.1845      8.311     47.067      0.000     374.893     407.476
==============================================================================
Omnibus:                    12672.728   Durbin-Watson:                   1.202
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          3902146.529
Skew:                           6.543   Prob(JB):                         0.00
Kurtosis:                      98.141   Cond. No.                     6.85e+03
==============================================================================

Interpretation

For every added square foot, the price of the house goes up by 391.1845 dollars

Predicting new data

import pandas as pd
import numpy as np
import statsmodels.api as sm

def main():
    data = pd.read_csv("data/clean_house_prices.csv")

    X = data[["bed", "house_size", "zip_code_99350"]]
    y = data["price"]

    result = sm.OLS(y, sm.add_constant(X)).fit()
    print(result.summary())

    data_dict = {"bed": [2, 3, 2], 
                 "house_size": [1400, 2300, 1400],
                 "zip_code_99350": [0, 0, 1] }
    df_dict = pd.DataFrame(data_dict)
    new_data = sm.add_constant(df_dict[["bed", "house_size", "zip_code_99350"]])
    
    predictions = result.predict(new_data)
    print(predictions)


main()

Selecting some columns from data

selected_data  = filtered_data.loc[:, ["price", "bed", "house_size", "zip_code"]]

Removing columns from data

columns_to_exclude = ["price"]
selected_data = data.drop(columns=columns_to_exclude)