Data Visualization – principles

Why is data visualization important?

Anscombe’s quartet highlights the importance of data visualization and warns against relying solely on summary statistics for data analysis:

library(tidyverse)
anscombe <- read_csv("data/anscombe.csv")

What’s in the mean and sd for x and y for each group?
Plot a scatter plot of x and y for each group

Mean and sd per group

anscombe |>
  group_by(group) |>
  summarize(mean_x = mean(x), 
            sd_x = sd(x),
            mean_y = mean(y),
            sd_y = sd(y))

group	mean_x	sd_x	mean_y	sd_y
1	9	3.316625	7.500909	2.031568
2	9	3.316625	7.500909	2.031657
3	9	3.316625	7.500000	2.030424
4	9	3.316625	7.500909	2.030578

Plots per group

anscombe |>
  ggplot(aes(x = x, y = y)) +
  geom_point(size = 3) +
  facet_wrap(~group)

dataSaurus

Dowload the datasaurus datat set and set up your working environment.

library(tidyverse)
datasaurus <- read_csv("data/datasaurus_dozen.csv")

What is the mean and standard deviation of x and y by dataset group?
Create a scatterplot for all dataset groups.

Descriptive statistics

datasaurus |> 
  group_by(dataset) |>
  summarize(mean_x = mean(x),
            sd_x = sd(x),
            mean_y = mean(y),
            sd_y = sd(y))

dataset	mean_x	sd_x	mean_y	sd_y
away	54.26610	16.76983	47.83472	26.93974
bullseye	54.26873	16.76924	47.83082	26.93573
circle	54.26732	16.76001	47.83772	26.93004
dino	54.26327	16.76514	47.83225	26.93540
dots	54.26030	16.76774	47.83983	26.93019
h_lines	54.26144	16.76590	47.83025	26.93988
high_lines	54.26881	16.76670	47.83545	26.94000
slant_down	54.26785	16.76676	47.83590	26.93610
slant_up	54.26588	16.76885	47.83150	26.93861
star	54.26734	16.76896	47.83955	26.93027
v_lines	54.26993	16.76996	47.83699	26.93768
wide_lines	54.26692	16.77000	47.83160	26.93790
x_shape	54.26015	16.76996	47.83972	26.93000

Scatterplots

datasaurus |> 
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  facet_wrap(~dataset)

Visual Perception

Edges, contrasts, and colors

Our visual system (i.e., all the visual processing models from our eyes to our brain) re-constructs what we are looking at
Our visual system uses relative differences instead of absolute values for brightness and colors
This relativization of brightness is demonstrated in the image in the next slide, in which we perceive darker blobs where grid lines meet at the edge of our visual field

Herman grid effect

At the very center of the visual field, we see the lines as they are (just white).
The center is where our vision is really good
Our ability to distinguish details decreases at the edges of our visual field (Ware, 2008)
Visual illusions (like the image in the previous slide) are side effects of our visual system processing.

Contrast

Another side effect of our visual system processing is perceiving the same shade as lighter or darker depending on the background behind shade
In the image on the next slide, the smaller squares are all the exact shade of gray
We perceive the gray to be lighter in darker backgrounds and darker in lighter backgrounds

Contrast

Edges, colors, and contrasts

Our visual system is not a physical light meter, what we perceive is different from actual values for colors and brightness
Humans are better at seeing edge contrasts for monochrome images
We are also better at distinguishing dark shades (compared to light shades)

Color

Let’s start with some definitions regarding color. “Color” refers to three components:

Hue refers to the dominant wavelength of the color (e.g., red, blue, purple)
Chroma or saturation refers to how intense of vivid the color is
Luminance refers how bright a color is

HSL

Color

Color schemes need to represent accurately differences in our data.
For example, changes from one level to next need to be perceive as having the same magnitude
We should be careful when picking colors for our variables in our data visualizations

Order these colors

Color

Gradients should be used as sequential scales from low to high to represent numeric continuous/ordered variables (e.g., population size). Light colors should represent low values, and dark colors high values.

Color

Diverging scales should be used where there is a neutral midpoint and then there is variance in ether direction from that neutral point (e.g., temperatures can be positive and negative).

Color

Qualitative palettes where each color has the same valence (i.e., no color dominates the other) should be used for unordered categorical variables (e.g., political parties, countries). Different colors in the palette should not imply differences in magnitude.

Color

We also need to take into account people who are color-blind, and how they perceive colors

Shape

We can also use shape as a channel to encode information
The color channel, however, is more salient, and shape can be used as a secondary channel.

Visual Search and Pop-out Effect

Try to find the type B data point in the three plots in the next slides

Visual Search and Pop-out Effect

Try to find the type B

Visual Search and Pop-out Effect

Try to find the type B

Visual Search and Pop-out Effect

Try to find the type B

Color and shape

The type B data point should have been easier to find in the first and last plots, and harder in the middle plot.
Color is a channel that is easier to process than shape
Combining both shape and color is at times a good strategy.

Color and shape

Encoding multiple channels in the same visualization can overtax the viewer when the multiple channels are not highlighting some structure in the data

see the two plots on the next slides

Color and shape – overtaxing

Color and shape

Encoding channels

Other encoding channels that are even less easily processed than shape are:

size
elongation
angle
movement

Gestalt Principles

humans are extremely good at finding patterns, even when no pattern exists
the ability to perceive different shapes and lines as a unified image rather than just a bunch of elements that are not related to each other follow what we call Gestalt principles (Healy, 2018).

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are close to each other – i.e., Proximity principle:

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are similar in shape, color and size – i.e., Similarity principle:

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are similar in shape, color and size – i.e., Similarity principle:

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are visually tied – i.e., Connection principle:

Gestalt Principles

Visual elements are perceived as being related to each other when these elements are perceived as being part paths, lines, curves, even when some elements are “hidden” – i.e., Continuity principle:

Gestalt Principles

not complete; we construct the incomplete form into familiar shapes – i.e., Closure principle