The data we will be working with contains variables as columns and observations as rows (often called tidy data)
Feature engineering: data –> features
Remember the spam vs not spam example? How do we go from words to features?
What to look for:
Rows and Columns
Day | High | Low | Wind | Forecast |
---|---|---|---|---|
Tuesday | 24 | 15 | 0 to 15 mph | Sunny |
Wednesday | 38 | 17 | 5 to 15 mph | Mostly Sunny |
Thursday | 34 | 13 | 5 to 15 mph | Mostly Sunny |
Tuesday:
↳ Temperature:
↳ Low: 15
↳ High: 24
↳ Wind:
↳ Speed: 0 to 15 mph
↳ Direction: West
Wednesday:
↳ Temperature:
↳ Low: 17
↳ High: 38
↳ Wind:
↳ Speed: 5 to 15 mph
↳ Direction: North West
One winter, I became very quiet
and saw my life. It was February
and outside in the city streets,
snow fell but would not collect.
I bought snapdragons and thistle,
got some discount peach roses
that smelled off. I split them
between vases and moved
the bouquets from room to room
while a violin solo rang out.
For machine learning, we have a feature matrix
that contains all of our predictor variables, and a target vector
that contains the label
or target
for each observation.
We usually need enough data to split it into training and validation (and testing)
The goal is to not overfit our data to our model