Project 1

In this project, you will be fitting a linear regression model to tuition in the US data.

Download the tuition data and set up your coding environment.

Data wrangling

Create a data_wrangling.py Python script and unsure that:

  1. the year variable is an integer – remove the - and everything after the dash
  2. the state variable is transformed into dummy variables, with one variable for each state, one hot encoded (with zeros and 1, integer format)
  3. your main() function reads the original file in a folder called data (use pd.read_csv("data/tuition.csv")) and write out a clean-tuition.csv file to the data folder (use .to_csv("data/clean-tuition.csv"))

Here’s what your clean-tuition.csv file should look like (showing first rows and first columns only):

Plot

Create a plots.ipynb python notebook file and with the clean-tuition.csv data create a scatterplot of tuition vs. academic year.

You should replicate this scatterplot:

Data modeling

Create a modeling.py Python script and run Ordinary Least Squares on the clean-tution.csv data.

Ensure that your main() function:

  1. prints out only the summary of the model, and the rsquared and rsquared adjusted values (in this order, your script should not print anything else) – you can use print(model.summary()), print(model.rsquared) and print(model.rsquared_adj) (replace model with the variable name you use in your script)
  2. writes a coefficients.csv file with all of your model’s coefficients – you can use this block of code to based your code on (again, replace model with your variable name)
coeffs = model.params
df_coeffs = pd.DataFrame(coeffs)
df_coeffs.columns = ["coefficient"]
df_coeffs["variable"] = df_coeffs.index
df_coeffs.to_csv("coefficients.csv", index=False)

Here’s what your coefficients.csv file should look like (showing first rows only):

Submit to gradescope

You are to submit three files to gradescope:

  • data_wrangling.py
  • plots.ipynb
  • modeling.py

You can submit other .py files if you are using any – don’t include any subfolders in your submission though, gradescope will not be considering any folders when running the code (only the data folder where .csv files should be placed).

The data gradescope will be using is slightly different from the data you are being provided with, but your code should be generalizable (meaning, no hardcoding of model coefficients, for example).

Click here to open gradescope assignment to submit your files