Class 7

Quarto, Tidy Data, Pivoting

Materials for class on

2024-09-17

Preparation Materials

Agenda

Today we’ll focus on:

  • Quick review from last class (source now revealed on class page)
  • Homework 1 pre-survey
  • Quarto html documents
  • Tidy data - focus on pivoting
  • In-class exercises

Homework 1

For homework 1, we’ll be analyzing our own class data on some linguistic variation in English. To prepare for that, we need to collect the data. One of the assignments for next class is to complete the survey!

Survey link

Tidy Data and Reshaping Functions

Stylized text providing an overview of Tidy Data. The top reads “Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.” On the left reads “In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.” There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure.

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Pivoting Data

Pivoting data in R refers to changing between “wide” and “long” formats. Long formats are considered “tidy”, but sometimes we want to analyze or present data in wide format as well.

One helpful thing to remember about the pivoting functions is that generally you pivot wider “from”, and pivot longer “to”. Any good mnemonics for that?

Some data that often are in wide format are grades. Let’s make some up. You can copy/paste this code to get the same data.

library(tidyverse)
# setting a "seed" is a way of reproducing the same "random" samples
set.seed(343) 
df_grades <- tibble(
  ID = as.factor(seq(1:50)),
  section = as.factor(rep(1:5, 10)),
  quiz1 = sample(1:100, 50, replace = TRUE),
  quiz2 = sample(1:100, 50, replace = TRUE),
  quiz3 = sample(1:100, 50, replace = TRUE)
) |> 
  arrange(section)
df_grades
#> # A tibble: 50 × 5
#>    ID    section quiz1 quiz2 quiz3
#>    <fct> <fct>   <int> <int> <int>
#>  1 1     1          77    41    69
#>  2 6     1          38    20    78
#>  3 11    1          52    42    63
#>  4 16    1          95    45    25
#>  5 21    1          12    77    59
#>  6 26    1           6    12     5
#>  7 31    1          88     5    87
#>  8 36    1          71    39    61
#>  9 41    1          71    97    99
#> 10 46    1          58     7    20
#> # ℹ 40 more rows

Here we have the grades wide - each quiz has its own column. This is easy to get a summary of the overall class average for each quiz, or to group by section:

df_grades |> 
  summarize(q1 = mean(quiz1, na.rm = TRUE),
            q2 = mean(quiz2, na.rm = TRUE),
            q3 = mean(quiz3, na.rm = TRUE))
#> # A tibble: 1 × 3
#>      q1    q2    q3
#>   <dbl> <dbl> <dbl>
#> 1  42.2  45.4  52.7
df_grades |> 
  group_by(section) |> 
    summarize(q1 = mean(quiz1, na.rm = TRUE),
              q2 = mean(quiz2, na.rm = TRUE),
              q3 = mean(quiz3, na.rm = TRUE))
#> # A tibble: 5 × 4
#>   section    q1    q2    q3
#>   <fct>   <dbl> <dbl> <dbl>
#> 1 1        56.8  38.5  56.6
#> 2 2        25.3  40.3  52.1
#> 3 3        31.4  56.1  59.4
#> 4 4        47.7  39.5  54.2
#> 5 5        50    52.5  41.4

However, it is not easy to plot this data because the scores are all in separate variables (and cannot be part of the same aesthetic). To get those quiz scores together in tidy format, we can pivot longer:

df_grades_long <- pivot_longer(
  df_grades, 
  cols = c(quiz1, quiz2, quiz3), 
  names_to = "quiz", 
  values_to = "score") 
df_grades_long |> 
  ggplot(aes(x = quiz, y = score, color = ID, group = ID)) +
  geom_line()

This plot is a bit silly because the grades are random, but the point is now we can plot it.

We could also plot the scores by section means:

df_grades_long |> 
  group_by(section, quiz) |> 
  summarize(mean = mean(score, na.rm = TRUE)) |> 
  ggplot(aes(x = quiz, y = mean, group = section, color = section)) +
  geom_line()
#> `summarise()` has grouped output by 'section'. You can override using the
#> `.groups` argument.

If we want it to go wide again, we can use pivot_wider():

df_grades_wide <- df_grades_long |> 
  pivot_wider(names_from = quiz, values_from = score)
df_grades_wide
#> # A tibble: 50 × 5
#>    ID    section quiz1 quiz2 quiz3
#>    <fct> <fct>   <int> <int> <int>
#>  1 1     1          77    41    69
#>  2 6     1          38    20    78
#>  3 11    1          52    42    63
#>  4 16    1          95    45    25
#>  5 21    1          12    77    59
#>  6 26    1           6    12     5
#>  7 31    1          88     5    87
#>  8 36    1          71    39    61
#>  9 41    1          71    97    99
#> 10 46    1          58     7    20
#> # ℹ 40 more rows

In-class activity

Let’s work together in our Quarto document on an in-class engagement assignment (instead of iClicker points). At the end, you’ll submit your file to Gradescope.

Here is a copy of the questions/tasks:

With the babynames dataset:

  1. Pivot the data so that there is only one row per name per year, with the values corresponding to sex “M” and sex “F” across the same row. (Make it less tidy!)
  2. Create a new column with the total ‘n’ for the year across both sexes. (you may want to go back to the Posit Primers for this!)
  3. Choose 5 names and plot lines for their ‘n’ across all years total across both sexes, so that each line has a different color.
  4. Change the colors to something other than the defaults.
  5. Now choose one name and make a plot with separate lines (different colors) for ‘n’ with sex “M” and ‘n’ with sex “F”. You have to figure out what shape your data should be in to do this - tidy or untidy? How do you get those variables to align with different colors?
  6. Pivot the original babynames dataframe to have separate rows for ‘n’ and ‘proportion’ data.
  7. Why can’t you pivot ‘sex’ together with ‘n’ and ‘prop’?
  8. Where did you observe ‘NA’ values? Why do you think they appeared?