1 Viz critique
2 Variable mapping to viz encodings
- 2.1 What variable types are present in the following data:
  - 2.1.1 What variables would you map to build the following visualizations based on these data:
  - 2.1.2 What color scheme would use for each of the visualizations above?
- 2.2 What variable types are present in the following data (NYTimes best sellers):
  - 2.2.1 What visualization would you built to answer the following questions? Include which variables you would map to each encoding, and what color scheme you would use:
3 Vega spec completion
4 Audiences, purposes, and storytelling

1 Viz critique

You need to apply Munzner’s What-Why-How Framework: map to data-task-idiom trio to evaluate the quality of the visualization:

What? (data, data quality)
Why? (user intent)
How? (visual encoding – scales, perceptual issues, interaction)

Remember also the general guidelines:

Maximize data-to-ink ratio
Avoid multiple font faces, and extraneous colors and backgrounds
Simplify, mute, or delete gridlines
Remove superfluous axis marks and labels
Avoid excessive decorative embellishments

1.1 Visualization A

Visualization by Stephan Teodosescu (@steodosescu)

Overall the visualizations are good at showing changes of number of flights over time for countries in Europe. However, it’s not clear what the colors mean (encoding). Also, counting departing AND arriving flights together obscures information – showing departing and arriving flights would be more meaningful (why). The changes in line get confusing the lower ranked the country – the colors don’t help here either (encoding).

1.2 Visualization B

Visualization by Pauline Baudry @PauBaudry

All bars sum to 100%, instead of splitting the comparison won vs. disqualified, which makes comparisons difficult (why). Especially since these are not all the ingredients used, but only the top 10 (data/encoding). The light color is very similar to the background, and the white font makes it very hard to read (encoding).

1.3 Visualization C

Visualization by Nicola Rennie @nrennie35

Number of matches played might not be the ideal to understand the countries that perform better (why) so it’s not clear what the point is here. The plot is hard to read, too many lines overlapping (encoding). Maybe a comparison between number of matches played for winners would be more informative (why).

1.4 Visualization D

Visualization by Dan Oehm @danoehm

Not everyone will recognize the flag for the countries, making it interactive with the country name showing up would add to the accessibility of the plot. Nice interpretation of results in the text at the top (why), but I am not sure the interpretation has any actual meaning. The home vs. away is not meaningful here (data), it is just the way the data is organized. The more I try to understand this plot, the harder it gets.

2 Variable mapping to viz encodings

Variables can be:

Quantitative/numeric
- Discrete: counts like number of bigfoot sightings
- Continuous: temperature, height
Categorical:
- Ordered: size
- Unordered: countries

Color schemes that best represent each type of variable:

Gradients/Sequential with single hue for continuous numeric variables from low to high to represent numeric continuous/ordered variables (e.g., population size). Light colors should represent low values, and dark colors high values.

Diverging scales should be used where there is a neutral midpoint and then there is variance in ether direction from that neutral point (e.g., temperatures can be positive and negative).

Qualitative palettes where each color has the same valence (i.e., no color dominates the other) should be used for unordered categorical variables (e.g., political parties, countries). Different colors in the palette should not imply differences in magnitude.

2.1 What variable types are present in the following data:

year	state	id	total	DEMOCRAT	REPUBLICAN	democrat_difference	republican_difference
2020	ALABAMA	1	2323282	0.3656999	0.6203164	-0.2546165	0.2546165
2020	ALASKA	2	359530	0.4277195	0.5283314	-0.1006119	0.1006119
2020	ARIZONA	4	3387326	0.4936469	0.4905598	0.0030871	-0.0030871
2020	ARKANSAS	5	1219069	0.3477506	0.6239573	-0.2762067	0.2762067
2020	CALIFORNIA	6	17500881	0.6348395	0.3432072	0.2916322	-0.2916322
2020	COLORADO	8	3279980	0.5501107	0.4160413	0.1340694	-0.1340694
2020	CONNECTICUT	9	1823857	0.5926073	0.3918712	0.2007361	-0.2007361
2020	DELAWARE	10	504346	0.5874301	0.3977488	0.1896813	-0.1896813
2020	DISTRICT OF COLUMBIA	11	344356	0.9214969	0.0539732	0.8675237	-0.8675237
2020	FLORIDA	12	11067456	0.4786145	0.5121982	-0.0335837	0.0335837

* Categorical unordered: state, id * Categorical/discrete ordered: year * Numeric discrete: total * Numeric continuous: DEMOCRAT, REPUBLICAN, democrat_difference (divergent), republican_difference (divergent)

2.1.1 What variables would you map to build the following visualizations based on these data:

A line plot showing the percentage of votes to the democratic candidate across the years for the state of Arizona x to year, y to DEMOCRAT, optionally: split by state with interactive element, so state mapped to color/interaction
A map plot with percent democrat/republican votes per state democrat_difference mapped to fill for each shape
A bar plot showing the 5 states that voted the most republican and the 5 states that voted the most democrat fill to democrat/republican, state mapped to y (so that we can read labels horizontally), total votes mapped to x, optional interactive element mapping to year

2.1.2 What color scheme would use for each of the visualizations above?

1. Qualitative color scheme for state 1. Divergent color scale, with red representing more republican and blue more democratic – make zero white 1. Red bars for republican votes, Blue for democratic votes. No need to map color to state since state is mapped to one of the axes.

2.2 What variable types are present in the following data (NYTimes best sellers):

id	title	author	year	total_weeks	first_week	debut_rank	best_rank
0	“H” IS FOR HOMICIDE	Sue Grafton	1991	15	1991-05-05	1	2
1	“I” IS FOR INNOCENT	Sue Grafton	1992	11	1992-04-26	14	2
10	‘’G’’ IS FOR GUMSHOE	Sue Grafton	1990	6	1990-05-06	4	8
100	A DOG’S JOURNEY	W. Bruce Cameron	2012	1	2012-05-27	3	14
1000	CHANGING FACES	Kimberla Lawson Roby	2006	1	2006-02-19	11	14
1001	CHAOS	Patricia Cornwell	2016	3	2016-12-04	1	7
1002	CHAPTERHOUSE: DUNE	Frank Herbert	1985	16	1985-04-21	9	2
1003	CHARADE	Sandra Brown	1994	5	1994-05-01	7	10
1004	CHARLESTON	John Jakes	2002	4	2002-08-25	7	12
1005	CHARLOTTE GRAY	Sebastian Faulks	1999	1	1999-03-14	12	17

* Categorical unordered: author, title, id * Categorical/discrete ordered: year, firt_week * Numeric discrete: total_weeks, debut_rank, best_rank

2.2.1 What visualization would you built to answer the following questions? Include which variables you would map to each encoding, and what color scheme you would use:

What are the top 10 books that stayed the most weeks in the NYTimes best sellers list? bar plot mapping title to y axis, and total number of weeks to x
How has the debut ranking for books by Stephen King changed over time? line plot with year mapped to x, and debut_rank mapped to y 1.1 How does debut rank for Stephen King compare with debut rank by Danielle Steel over time? **line plot with year mapped to x, debut_rank mapped to y, qualitative fill color mapped to author*
Which books had the largest difference between best rank and debut rank? line plot with debut rank vs. best rank labels (categories for rank type) mapped to x, and actual rank number for debut_rank and best_rank mapped to y, two colors: one for positive change, another color for negative change. Annotate plot with title of the book (or make it show up when hovering over the line)

3 Vega spec completion

Complete the Vega specification for the three plots (they all use the same data – NY Times Best Sellers).

var spec = {
  $schema: "https://vega.github.io/schema/vega/v5.json",
  description: "NY Times Best Sellers of All Times",
  width: 800,
  height: 400,
  padding: 50,
  data: [
    {
      name: "books",
      url: "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-10/nyt_titles.tsv",
      format: { type: "tsv" }
    },
    { 
      name: "aggregate",
      source: "books",
      transform: [
        {
          type: "aggregate",
          groupby: ["year"],
          fields: ["total_weeks"],
          ops: ["mean"],
          as: ["total_weeks"]
        }
      ]
    }
  ],
  scales: [
    {
      name: "xScale",
      type: "linear",
      domain: { field: "year", data: "aggregate" },
      range: "width",
      zero: false
    },
    {
      name: "yScale",
      type: "linear",
      domain: { field: "total_weeks", data: "aggregate" },
      range: "height",
      zero: false
    }
  ],
  axes: [
    {
      scale: "xScale",
      orient: "bottom",
      format: "d",
      title: "Year"
     
    },
    {
      scale: "yScale",
      orient: "left",
      title: "average number of weeks in the NY Times Best Sellers list"
    }
  ],
  marks: [
    {
      type: "symbol",
      from: { data: "aggregate" },
      encode: {
        enter: {
          x: { field: "year", scale: "xScale" },
          y: { field: "total_weeks", scale: "yScale" },
        }
      }
    }
  ],
  title: {
    text: "NY Times Best Sellers Books by Average Total Weeks"
  }
};

var spec = {
  $schema: "https://vega.github.io/schema/vega/v5.json",
  description: "NY Times Best Sellers of All Times",
  width: 800,
  height: 800,
  padding: 50,
  data: [
    {
      name: "books",
      url: "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-10/nyt_titles.tsv",
      format: { type: "tsv" },
      transform: [
        {
          type: "formula",
          expr: "datum.total_weeks / 1",
          as: "total_weeks"
        },
        {
          type: "collect",
          sort: { field: "total_weeks", order: "descending"}
        },
        {
          type: "filter",
          expr: "datum.total_weeks > 94"
        }
      ]
    }
  ],
  scales: [
    {
      name: "yScale",
      type: "linear",
      domain: [2020, 1931],
      range: "height",
      zero: false
    },
    {
      name: "xScale",
      type: "linear",
      domain: { field: "total_weeks", data: "books" },
      range: "width"
    }
  ],
  axes: [
    {
      scale: "xScale",
      orient: "bottom",
      title: "total weeks in the NY Times Best Sellers list"
    },
    {
      scale: "yScale",
      orient: "left",
      format: "d",
      title: "Year"
      
    }
  ],
  marks: [
    {
      type: "rect",
      from: { data: "books" },
      encode: {
        enter: {
          y: { field: "year", scale: "yScale" },
          x: { field: "total_weeks", scale: "xScale" },
          x2: { value: 0, scale: "xScale" },
          height: { value: 3 }
        }
      }
    },
    {
      type: "text",
      from: {data : "books" },
      encode: {
        enter: {
          text: { signal: "datum.title + ' by ' + datum.author + ' ' + datum.year" },
          y: { field: "year", scale: "yScale" },
          x: { field: "total_weeks", scale: "xScale" },
          align: { value: "right"}
        }
      }
    }
  ],
  title: {
    text: "NY Times Best Sellers Books with longest total weeks in list",
    subtitle: "The 1980s didn't have any books in the list for more than 94 weeks"
  }
};

var spec = {
  $schema: "https://vega.github.io/schema/vega/v5.json",
  description: "NY Times Best Sellers of All Times",
  width: 800,
  height: 400,
  padding: 50,
  data: [
    {
      name: "books",
      url: "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-05-10/nyt_titles.tsv",
      format: { type: "tsv" }
    },
    { 
      name: "aggregate",
      source: "books",
      transform: [
        {
          type: "aggregate",
          groupby: ["year"]
        }
      ]
    }
  ],
  scales: [
    {
      name: "xScale",
      type: "linear",
      domain: { field: "year", data: "aggregate" },
      range: "width",
      zero: false
    },
    {
      name: "yScale",
      type: "linear",
      domain: { field: "count", data: "aggregate" },
      range: "height",
      zero: true
    }
  ],
  axes: [
    {
      scale: "xScale",
      orient: "bottom",
      title: "Year",
      format: "d"
     
    },
    {
      scale: "yScale",
      orient: "left",
      title: "Number of Books in the NY Times Best Sellers list"
    }
  ],
  marks: [
    {
      type: "rect",
      from: { data: "aggregate" },
      encode: {
        enter: {
          x: { field: "year", scale: "xScale" },
          y: { field: "count", scale: "yScale" },
          y2: { value: 0, scale: "yScale" },
          width: { value: 5 }
        }
      }
    }
  ],
  title: {
    text: "Total NY Times Best Sellers Books by year"
  }
};

4 Audiences, purposes, and storytelling

Consider the three plots from the previous questions when answering the following questions:

4.1 What would be the 3-minute story that the three plots tell?

The New York Times Best Sellers are up-to-date and authoritative lists of the most popular books in the United States, based on sales in the past week. It can be used as a measure of what Americans are reading across the years. There was an increase of average number of weeks books stayed in the list from the 1930 to the 1970s – then the average number of weeks books stayed in the list has gradually decreased since the 1970s. Did books get less popular, or was just more books being published so public attention is divided among the many reading options available? Looking at the data in a more fine grained manner shows us that there were no books that stayed in the list for more than 94 weeks between 1960 and 1990. At the same time, when we investigate the total number of books in the list per year, there was a decrease in publications after 1960, with number of books increasing steadily since the 1980s. So it does seem that the 1970s and 1980s were not great decades for popular books, but we do have an abundance of books since the 2000s, with a number of books staying in the list well over 100 weeks total

4.2 What is the big idea?

The 1970s and 1980s were not great for the book industry (not a lot of popular books in the NY Times Best Sellers list). There has been an abundance of books in the list since the 2000s, with a number of books staying in the list well over 100 weeks total. It is not clear if this is an indication that people are reading more, but it is an indication that more books are being published.

4.3 What would the audience be for this story?

Bookstores that want to make informed decisions on what to stock their shelves with.

CSC 444 Data Visualization

Final Exam Review

December 6, 2022