Data

Where to find data

Even before we start analyzing data, we have to acquire data

There are a lot of data sets openly availabe on the internet, for example:

Data quality

Even before we start analyzing data, we need to make sure our data is tidy

What to look for:

  • a data dictionary
  • information on how the data were collected

Data format

  • tabular – tables, rows and columns
  • hierachical – values are nested (like a tree)
  • unstructured data – no structure, for example: emails, videos, pictures

Tabular Data

Rows and Columns

Day High Low Wind Forecast
Tuesday 24 15 0 to 15 mph Sunny
Wednesday 38 17 5 to 15 mph Mostly Sunny
Thursday 34 13 5 to 15 mph Mostly Sunny

Hierachical Data

Tuesday:
   ↳ Temperature:
      ↳ Low: 15
      ↳ High: 24
   ↳ Wind:
      ↳ Speed: 0 to 15 mph 
      ↳ Direction: West
Wednesday:
   ↳ Temperature:
      ↳ Low: 17
      ↳ High: 38
   ↳ Wind:
      ↳ Speed: 5 to 15 mph
      ↳ Direction: North West

Unstructured Data

One winter, I became very quiet
and saw my life. It was February

and outside in the city streets,
snow fell but would not collect.

I bought snapdragons and thistle,
got some discount peach roses

that smelled off. I split them
between vases and moved

the bouquets from room to room
while a violin solo rang out.

full poem

Checking Understanding

Answer the gradescope questions on Data Formats

Remember that you can click on save for each answer to get feedback on whether you got the answer correct before clicking on submit

Data format – file type

  • Data can be stored in different types of files
  • Some formats can be open only with specific software (for example, .xlsx files)
  • Often data is stored in a plain text file, with values separated by a specific charater, for example
    • comma (.csv files, or comma separated values)
    • tab (.tsv files, or comma separated values)
    • JSON (JavaScript Object Notation, hierarchical name–value pairs) – APIs often provide this type of data

Comma/Tab Separated Values

Inspect this data set:

  • What format is it?
  • What are the variables (and types)?

What about this data set

JSON

Inspect this weather data

  • What are (some) the variables (and types)?

These data were retrieved using the weather.gov api

Data format

Get data on mortality by country

  • What format is the data?
  • What are the variables?

The data you will be working with is often too large for excel.

What’s tidy data?

Even before we start analyzing data, we need to make sure our data is tidy

  • Each column is a variable
  • Each row is an observation

We can wrangle our raw data to make it tidy

Example of tidy data

Each column is a variable, each row is an observation

Cheese data

Practice

Take a look at the Apple Sales Dataset on Kaggle and answer the following questions:

  • What are the variables in the data?
  • What are the variable types? (Categorical – ordinal or not, numeric – discrete or continuous)
  • Is this data set tidy? Why (not)?