= model.params
coeffs = pd.DataFrame(coeffs)
df_coeffs = ["coefficient"]
df_coeffs.columns "variable"] = df_coeffs.index
df_coeffs["coefficients.csv", index=False) df_coeffs.to_csv(
Project 1
In this project, you will be fitting a linear regression model to tuition in the US data.
Download the tuition data and set up your coding environment.
Data wrangling
Create a data_wrangling.py
Python script and unsure that:
- the year variable is an integer – remove the
-
and everything after the dash - the state variable is transformed into dummy variables, with one variable for each state, one hot encoded (with zeros and 1, integer format)
- your
main()
function reads the original file in a folder calleddata
(usepd.read_csv("data/tuition.csv")
) and write out aclean-tuition.csv
file to thedata
folder (use.to_csv("data/clean-tuition.csv")
)
Here’s what your clean-tuition.csv
file should look like (showing first rows and first columns only):
Plot
Create a plots.ipynb
python notebook file and with the clean-tuition.csv
data create a scatterplot of tuition vs. academic year.
You should replicate this scatterplot:
Data modeling
Create a modeling.py
Python script and run Ordinary Least Squares on the clean-tution.csv
data.
Ensure that your main()
function:
- prints out only the summary of the model, and the
rsquared
andrsquared adjusted
values (in this order, your script should not print anything else) – you can useprint(model.summary())
,print(model.rsquared)
andprint(model.rsquared_adj)
(replacemodel
with the variable name you use in your script) - writes a
coefficients.csv
file with all of your model’s coefficients – you can use this block of code to based your code on (again, replacemodel
with your variable name)
Here’s what your coefficients.csv
file should look like (showing first rows only):
Submit to gradescope
You are to submit three files to gradescope:
data_wrangling.py
plots.ipynb
modeling.py
You can submit other .py
files if you are using any – don’t include any subfolders in your submission though, gradescope will not be considering any folders when running the code (only the data
folder where .csv
files should be placed).
The data gradescope will be using is slightly different from the data you are being provided with, but your code should be generalizable (meaning, no hardcoding of model coefficients, for example).
Click here to open gradescope assignment to submit your files