Data

Data

The data we will be working with contains variables as columns and observations as rows (often called tidy data)

Feature engineering: data –> features

Remember the spam vs not spam example? How do we go from words to features?

Data quality

What to look for:

  • a data dictionary
  • information on how the data were collected

Data format

  • tabular – tables, rows and columns
  • hierachical – values are nested (like a tree)
  • unstructured data – no structure, for example: emails, videos, pictures

Tabular Data

Rows and Columns

Day High Low Wind Forecast
Tuesday 24 15 0 to 15 mph Sunny
Wednesday 38 17 5 to 15 mph Mostly Sunny
Thursday 34 13 5 to 15 mph Mostly Sunny

Hierachical Data

Tuesday:
   ↳ Temperature:
      ↳ Low: 15
      ↳ High: 24
   ↳ Wind:
      ↳ Speed: 0 to 15 mph 
      ↳ Direction: West
Wednesday:
   ↳ Temperature:
      ↳ Low: 17
      ↳ High: 38
   ↳ Wind:
      ↳ Speed: 5 to 15 mph
      ↳ Direction: North West

Unstructured Data

One winter, I became very quiet
and saw my life. It was February

and outside in the city streets,
snow fell but would not collect.

I bought snapdragons and thistle,
got some discount peach roses

that smelled off. I split them
between vases and moved

the bouquets from room to room
while a violin solo rang out.

full poem

Matrices and Vectors

For machine learning, we have a feature matrix that contains all of our predictor variables, and a target vector that contains the label or target for each observation.

Training, Validation and Testing

We usually need enough data to split it into training and validation (and testing)

  • Most of our data will be used to build our model (training)
  • We never predict data that was in our training data set
  • Our validation data set is used to fine tune our model (we can also do cross-validation)
  • The test data set is used to assess the final mode that has been selected during the validation process

The goal is to not overfit our data to our model

Practice

  • Access this kaggle dataset on house prices
  • What are the variables in the data? (is there a data dictionary?)
  • What is this data set from? What’s its source?
  • Any problems you see with it?
  • Which variable could we use as target (response)?
  • Which variables would you use as features? Any feature engineering you can think of?