Data Visualization – principles

Why is data visualization important?

Anscombe’s quartet highlights the importance of data visualization and warns against relying solely on summary statistics for data analysis:

library(tidyverse)
anscombe <- read_csv("data/anscombe.csv")
  1. What’s in the mean and sd for x and y for each group?
  2. Plot a scatter plot of x and y for each group

Mean and sd per group

anscombe |>
  group_by(group) |>
  summarize(mean_x = mean(x), 
            sd_x = sd(x),
            mean_y = mean(y),
            sd_y = sd(y))
group mean_x sd_x mean_y sd_y
1 9 3.316625 7.500909 2.031568
2 9 3.316625 7.500909 2.031657
3 9 3.316625 7.500000 2.030424
4 9 3.316625 7.500909 2.030578

Plots per group

anscombe |>
  ggplot(aes(x = x, y = y)) +
  geom_point(size = 3) +
  facet_wrap(~group)

dataSaurus

Dowload the datasaurus datat set and set up your working environment.

library(tidyverse)
datasaurus <- read_csv("data/datasaurus_dozen.csv")
  1. What is the mean and standard deviation of x and y by dataset group?
  2. Create a scatterplot for all dataset groups.

Descriptive statistics

datasaurus |> 
  group_by(dataset) |>
  summarize(mean_x = mean(x),
            sd_x = sd(x),
            mean_y = mean(y),
            sd_y = sd(y))
dataset mean_x sd_x mean_y sd_y
away 54.26610 16.76983 47.83472 26.93974
bullseye 54.26873 16.76924 47.83082 26.93573
circle 54.26732 16.76001 47.83772 26.93004
dino 54.26327 16.76514 47.83225 26.93540
dots 54.26030 16.76774 47.83983 26.93019
h_lines 54.26144 16.76590 47.83025 26.93988
high_lines 54.26881 16.76670 47.83545 26.94000
slant_down 54.26785 16.76676 47.83590 26.93610
slant_up 54.26588 16.76885 47.83150 26.93861
star 54.26734 16.76896 47.83955 26.93027
v_lines 54.26993 16.76996 47.83699 26.93768
wide_lines 54.26692 16.77000 47.83160 26.93790
x_shape 54.26015 16.76996 47.83972 26.93000

Scatterplots

datasaurus |> 
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~dataset)

Visual Perception

Edges, contrasts, and colors

  • Our visual system (i.e., all the visual processing models from our eyes to our brain) re-constructs what we are looking at
  • Our visual system uses relative differences instead of absolute values for brightness and colors
  • This relativization of brightness is demonstrated in the image in the next slide, in which we perceive darker blobs where grid lines meet at the edge of our visual field

Herman grid effect

Herman grid effect

  • At the very center of the visual field, we see the lines as they are (just white).
  • The center is where our vision is really good
  • Our ability to distinguish details decreases at the edges of our visual field (Ware, 2008)
  • Visual illusions (like the image in the previous slide) are side effects of our visual system processing.

Contrast

  • Another side effect of our visual system processing is perceiving the same shade as lighter or darker depending on the background behind shade
  • In the image on the next slide, the smaller squares are all the exact shade of gray
  • We perceive the gray to be lighter in darker backgrounds and darker in lighter backgrounds

Contrast

Edges, colors, and contrasts

  • Our visual system is not a physical light meter, what we perceive is different from actual values for colors and brightness
  • Humans are better at seeing edge contrasts for monochrome images
  • We are also better at distinguishing dark shades (compared to light shades)

Color

Let’s start with some definitions regarding color. “Color” refers to three components:

  • Hue refers to the dominant wavelength of the color (e.g., red, blue, purple)
  • Chroma or saturation refers to how intense of vivid the color is
  • Luminance refers how bright a color is

HSL

Color

  • Color schemes need to represent accurately differences in our data.
  • For example, changes from one level to next need to be perceive as having the same magnitude
  • We should be careful when picking colors for our variables in our data visualizations

Order these colors

Order these colors

Color

  • Gradients should be used as sequential scales from low to high to represent numeric continuous/ordered variables (e.g., population size). Light colors should represent low values, and dark colors high values.

Color

  • Diverging scales should be used where there is a neutral midpoint and then there is variance in ether direction from that neutral point (e.g., temperatures can be positive and negative).

Color

  • Qualitative palettes where each color has the same valence (i.e., no color dominates the other) should be used for unordered categorical variables (e.g., political parties, countries). Different colors in the palette should not imply differences in magnitude.

Color

  • We also need to take into account people who are color-blind, and how they perceive colors

Shape

  • We can also use shape as a channel to encode information
  • The color channel, however, is more salient, and shape can be used as a secondary channel.

Visual Search and Pop-out Effect

Try to find the type B data point in the three plots in the next slides

Visual Search and Pop-out Effect

Try to find the type B

Visual Search and Pop-out Effect

Try to find the type B

Visual Search and Pop-out Effect

Try to find the type B

Color and shape

  • The type B data point should have been easier to find in the first and last plots, and harder in the middle plot.
  • Color is a channel that is easier to process than shape
  • Combining both shape and color is at times a good strategy.

Color and shape

  • Encoding multiple channels in the same visualization can overtax the viewer when the multiple channels are not highlighting some structure in the data

see the two plots on the next slides

Color and shape – overtaxing

Color and shape

Encoding channels

Other encoding channels that are even less easily processed than shape are:

  • size
  • elongation
  • angle
  • movement

Gestalt Principles

  • humans are extremely good at finding patterns, even when no pattern exists
  • the ability to perceive different shapes and lines as a unified image rather than just a bunch of elements that are not related to each other follow what we call Gestalt principles (Healy, 2018).

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are close to each other – i.e., Proximity principle:

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are similar in shape, color and size – i.e., Similarity principle:

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are similar in shape, color and size – i.e., Similarity principle:

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are visually tied – i.e., Connection principle:

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are perceived as being part paths, lines, curves, even when some elements are “hidden” – i.e., Continuity principle:

Gestalt Principles

  1. not complete; we construct the incomplete form into familiar shapes – i.e., Closure principle