Question 1 - Scales

The specs for the two plots above have just one property that is different between them. Complete the spec below for both plots.

Plot 1

                    
                    scales: [
                                {
                                    name: "xScale",
                                    domain: [2014, 2018],
                                    range: "width",
                                    zero: false
                                },
                                {
                                    name: "yScale",
                                    domain: { data: "aggregated", field: "m" },
                                    range: "height",
                                    zero: true
                                }
                            ]
                    

Plot 2

                    
                    scales: [
                                {
                                    name: "xScale",
                                    domain: [2014, 2018],
                                    range: "width",
                                    zero: false
                                },
                                {
                                    name: "yScale",
                                    domain: { data: "aggregated", field: "m" },
                                    range: "height",
                                    zero: false
                                }
                            ]
                    

What are the pros and cons of each scale? Which scale reflects the data more accurately in your opinion?

The first plot shows the scale starting at zero for dollars (y) which makes the differences in price point less drastic but more realistic. If you want to make the point that avocados were more expensive in 2017, you'd use the non-zero scale (plot 2).

Question 2 - Bar Plots

Question 2a - ordering unordered categorical variables

The specs for the two plots above have just one property that is different between them. Complete the spec below for plot 2 to sort the bars.

Plot 2

                    
                        ...
                        data: [
                                {
                                name: "lego_themes",
                                url: "https://raw.githubusercontent.com/picoral/csc-444-data/main/lego_themes.json",
                                transform: [
                                    {
                                      type: "collect",
                                      sort: { field: "n", order: "descending" }
                                    }
                                ]
                            }
                        ],
                        scales: [
                            {
                                name: "xScale",
                                type: "band",
                                domain: { data: "lego_themes", field: "theme"},
                                range: "width",
                                padding: 0.6
                            },
                            {
                                name: "yScale",
                                type: "linear",
                                domain: { data: "lego_themes", field: "n" },
                                range: "height"
                            }
                        ],
                        ...
                    

Why is sorting bars by the numeric variable recommended in this case?

Theme is an unordered categorical variable, so there's no reason to order them by alphabetical order. By ordering them by the numeric variable, it's much easier to identify the categories with the highest and lowest counts.

Question 2b - Labeling bars

The specs for the two plots above have just one property that is different between them. Complete the spec below for plot 2 to add values to the bars.

Plot 2

                    
                        ...
                        marks: [
                            {
                                type: "rect",
                                from: { data: "lego_themes" },
                                encode: {
                                    enter: {
                                        x: { field: "theme", scale: "xScale"},
                                        y2: { value: 0, scale: "yScale"},
                                        y: { field: "n", scale: "yScale"},
                                        width: { value: 30}
                                    }
                                }
                            },
                            {
                                type: "text",
                                from: { data: "lego_themes" },
                                encode: {
                                    enter: {
                                        x: { field: "theme", scale: "xScale" },
                                        y: { field: "n", scale: "yScale" },
                                        text: { field: "n"},
                                        dy: { value: 10 },
                                        dx: { value: 5 },
                                        fill: { value: "white" }
                                    }
                                }
                            }
                        ],
                        ...
                    

List one advantage and one disadvantage of adding values to bars in a bar plot.

One advantage is that the duplicate information makes the values more obvious, but the disadvantage is the duplication of information (not maximizing data to ink).

Question 3 - Stacked Bar Plots

Question 3a - labeling and ordering unordered categorical variables

The specs for the two plots above have just two properties that are different between them. Complete the spec below for plot 2 to sort the bars, and add a legend. You can look at the data and the variables you have available.

Plot 2

                        
        // create spec
        var spec = { 
            $schema: "https://vega.github.io/schema/vega/v5.json",
            description: "A bar plot of Artists in the USA",
            width: 800,
            height: 500,
            padding: 50,
            autosize: "fit",
            data: [
                {
                    name: "artists",
                    url: "https://raw.githubusercontent.com/picoral/csc-444-data/main/artists.json",
                    transform: [
                        {
                            type: "stack",
                            groupby: ["type"],
                            field: "percent",
                            as: ["x0", "x1"],
                            sort: { field: "race" }
                        },
                        {
                            type: "collect",
                            sort: { field: "max" }
                        }
                    ]
                }
            ],
            scales: [
            {
                name: "xScale",
                type: "linear",
                domain: [0, 1],
                range: "width"
            },
            {
                name: "yScale",
                type: "band",
                domain: { data: "artists", field: "type"},
                range: "height",
                padding: .5
            },
            {
                name: "colorScale",
                type: "ordinal",
                domain: { data: "artists", field: "race" },
                range: {"scheme": "category20"}
            }
            ],
            axes: [
            {
                scale: "xScale",
                orient: "bottom"
            },
            {
                scale: "yScale",
                orient: "left"
            }
            ],
            marks: [
            {
                type: "rect",
                from: { data: "artists" },
                encode: {
                enter: {
                    x: { field: "x0", scale: "xScale"},
                    x2: { field: "x1", scale: "xScale"},
                    y: { field: "type", scale: "yScale" },
                    height: { value: 20 },
                    fill: { field: "race", scale: "colorScale"}
                }
                }
            }
            ],
            legends: [
              {
                fill: "colorScale"
              }
            ]
        };
                        

Explain why the sorted version is better.

There's a lot of information in this plot. Comparing bar lengths is not easy, by sorting by the longest bar at least we know which artist categories have the least and most percentage of white people.

Question 3b - Representing percentages

Why is plot 1 preferable over plot 2?

The bars in plot are too skinny, they are hard to read, and to compare heights (if ordering race by value, each group would have a different order, which would be more confusing), Since the percentages sum up to 1, plot 1 makes more sense because then the bars across artist types are all the same lenght, so we can focus on the proportions for race better.

Question 4 - Scatterplots

Question 4a - Filtering Data

Identify the differences between the two plots above. What are they? What would the differences in the spec be? (you can use prose to answer this) You can look at the data and the variables you have available.

The scale for the first plot starts at (0, 0) and there are a number of data points with zero average price and zero total weekly income. In the second plot, the data has been filtered to eliminate the zero values and also to eliminate average ticket price that is higher than 100 dollars.
A data transform of type "filter" and using an expr: "datum.avg_ticket_price < 100 & datum.avg_ticket_price > 0" was used in the spec for the second plot.

Question 4b - Color in Scatterplots

Identify the differences between the two plots above. What are they? What would the differences in the spec be? (you can use prose to answer this) You can look at the data and the variables you have available.

The first plot has average price mapped to the y axis, and total weekly gross to the x axis. That shows that as the average ticket price increases so does total weekly gross income.
Two plays are shows -- a filter was applied to the data with the expression "datum.show == 'The Phantom of the Opera' || datum.show == 'The Lion King'" -- that is the case for both plots.
The first plot does show that the average price ticket and total weekly gross in dollars for The Lion King are higher compared to The Phantom of the Opera.
The second plot shows total weekly gross across week number. Again we see that the total weekly gross for The Lion King is higher, and that is the case across all weeks the plays were on Broadway.
There's also an increase in total weekly gross between weeks 30 and 35 for both plays.

Question 5 - Aggregating Data

Complete the spec below for plot 2 to aggregate the data for sum for every week in every year each show was on Broadway. You can look at the data and the variables you have available.

Plot 2

                        
                        transform: [
                            {
                                type: "aggregate",
                                groupby: ["week_number", "show"],
                                fields: [weekly_gross],
                                ops: [sum],
                                as: ["total_gross"]
                            }
                        ]
                    

Identify the differences between the two plots above. What are they? What would the differences in the spec be? (you can use prose to answer this) You can look at the data and the variables you have available.

The first plot shows an observation per week per show in multiple years that the show was on Broadway, with x mapped to week number in the season/year and y mapped to total weekly gross in US dollars.
The second plot has the same variables mapped to the x and y, but the difference is that instead of having multiple observations per week number for each show, the observations were summed up for all years.