Class 26

Collecting Data, and a Smidgen of Statistics and ML

Materials for class on

2024-11-26

Agenda

Today we’ll focus on:

the importance of data visualization
data collection tools and pointers
Just enough statistics to be dangerous
a hint towards ML strategies

The Importance of Data Visualization

You might think from reading research articles that statistics is all that really matters, and visualization is just for fun human engagement. It’s not just a pretty way to look at summary or even inferential statistics though! Data visualization shows things that do not “appear” in summaries. The most famous illustration of this is Anscombe’s Quartet, which you can explore with the {anscombe} package. There’s a more amusing version that has garnered attention, through Alberto Cairo’s tweeted “datasaurus”. This concept is now realized in the {datasauRus} package.

library(datasauRus)
library(tidyverse)

You can see that the summary statistics for the multiple datasets are nearly identical:

datasaurus_dozen  |>  
    group_by(dataset) |>  
    summarize(
      mean_x    = mean(x),
      mean_y    = mean(y),
      std_dev_x = sd(x),
      std_dev_y = sd(y),
      corr_x_y  = cor(x, y)
    )

#> # A tibble: 13 × 6
#>    dataset    mean_x mean_y std_dev_x std_dev_y corr_x_y
#>    <chr>       <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#>  1 away         54.3   47.8      16.8      26.9  -0.0641
#>  2 bullseye     54.3   47.8      16.8      26.9  -0.0686
#>  3 circle       54.3   47.8      16.8      26.9  -0.0683
#>  4 dino         54.3   47.8      16.8      26.9  -0.0645
#>  5 dots         54.3   47.8      16.8      26.9  -0.0603
#>  6 h_lines      54.3   47.8      16.8      26.9  -0.0617
#>  7 high_lines   54.3   47.8      16.8      26.9  -0.0685
#>  8 slant_down   54.3   47.8      16.8      26.9  -0.0690
#>  9 slant_up     54.3   47.8      16.8      26.9  -0.0686
#> 10 star         54.3   47.8      16.8      26.9  -0.0630
#> 11 v_lines      54.3   47.8      16.8      26.9  -0.0694
#> 12 wide_lines   54.3   47.8      16.8      26.9  -0.0666
#> 13 x_shape      54.3   47.8      16.8      26.9  -0.0656

However the underlying data are very different!

  ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset))+
    geom_point()+
    theme_void()+
    theme(legend.position = "none")+
    facet_wrap(~dataset, ncol = 4)

Why Inferential Statistics?

Let’s look at the data from the Husband 2022 paper.

df_all <- read.csv(here("classes", "_data", "delong maze 40Ss.csv"),
  header = 1, sep = ",", comment.char = "#", strip.white = T,
  col.names = c("Index", "Time", "Counter", "Hash", "Owner", 
                "Controller", "Item", "Element", "Type", "Group", 
                "FieldName", "Value", "WordNum", "Word", "Alt", 
                "WordOn", "CorrWord", "RT", "Sent", "TotalTime", 
                "Question", "Resp", "Acc", "RespRT")
)

Here I’ll just give you an example of translating some of the code from the original analysis into tidyverse style:

df_rt <- df_all |> 
  filter(Controller == "Maze" & !str_detect(Type, "prac")) |> 
  select(1:10, 13:20) |> 
  separate(col = Type, 
           into = c("exp", "item", "expect", "position", "pos", 
                    "cloze", "art.cloze", "n.cloze"), 
           sep = "\\.", convert = TRUE, fill = "right") |> 
  mutate(WordNum = as.numeric(WordNum),
         Acc = as.numeric(as.character(recode(CorrWord, yes = "1", no = "0"))),
         n.cloze.scale =  scale(n.cloze), 
         art.cloze.scale = scale(art.cloze)) |> 
  mutate(across(where(is.character), as.factor)) |> 
  filter(item != 29) |> 
  filter(Hash != "9dAvrH0+R6a0U5adPzZSyA")

rt.s.filt <- df_rt |> 
  mutate(rgn.fix = WordNum - pos + 1,
         word.num.z = scale(WordNum),
         word.len = nchar(as.character(Word)),
         Altword.len = nchar(as.character(Alt)),
         expect = `contrasts<-`(expect,TRUE, c(-.5,.5)),
         item.expect = paste(item, expect, sep=".")
       ) |> 
  filter(!Hash %in% c("gyxidIf0fqXBM7nxg2K7SQ","f8dC3CkleTBP9lUufzUOyQ"))

rgn.rt.raw <- rt.s.filt %>%
  filter(rgn.fix > -4 & rgn.fix < 5) %>%
  filter(Acc == 1) %>%
  group_by(rgn.fix, expect) %>%
  summarize(n = n(), subj = length(unique(Hash)), rt = mean(RT), 
            sd = sd(RT), stderr = sd / sqrt(subj)) %>%
  as.data.frame()

rgn.rt.raw$rgn <- as.factor(recode(rgn.rt.raw$rgn.fix, "-3"="CW-3", "-2"="CW-2", "-1"="CW-1", "0"="art", "1"="n","2"="CW+1", "3"="CW+2", "4"="CW+3"))
rgn.rt.raw$rgn <- ordered(rgn.rt.raw$rgn, levels = c("CW-3", "CW-2", "CW-1", "art", "n", "CW+1", "CW+2", "CW+3"))

Zooming in on the article responses, and looking at the grand averaged data :

rt.s.filt |> 
  filter(rgn.fix == 0) |> 
  group_by(Hash, expect) |> 
  summarize(RT = mean(RT, na.rm = TRUE)) |> 
  group_by(expect) |> 
  summarize(RT = mean(RT, na.rm = TRUE)) |>
  gt() |> 
  fmt_number(decimals = 0)

expect	RT
expected	672
unexpected	717

The unexpected articles took longer than the expected ones, judging by the mean. But it’s only by 45 milliseconds. How much variance is there underneath that mean? Look at the individual points, representing participants.

set.seed(343)
p_parts <- rt.s.filt |> 
  filter(rgn.fix == 0) |> 
  group_by(Hash, expect) |> 
  summarize(RT = mean(RT, na.rm = TRUE)) |> 
  ggplot(aes(x=expect, y=RT, color = expect)) +
  geom_jitter(stat = "identity", width = .1, alpha = .8) +
  geom_point(stat = "summary", fun = mean, 
             shape = 4, color = "blue", size = 4) +
  labs(x = "Condition", y = "Reading Time (msec)") + 
  theme_bw()  
ggplotly(p_parts)

Let’s look at whether individual participants generally show the effect by looking at their slopes. Do they all go up to the right?

rt.s.filt |> 
  filter(rgn.fix == 0) |> 
  group_by(Hash, expect) |> 
  summarize(RT = mean(RT, na.rm = TRUE)) |> 
  ggplot(aes(x=expect, y=RT, shape = expect, group = Hash, color = Hash)) +
  geom_line() +
  geom_point(stat = "identity", alpha = .8, size = 2) +
  labs(x = "Condition", y = "Reading Time (msec)") + 
  theme_minimal() +
  theme(legend.position = "none")

There’s a mix of ups and downs, and flat lines.

Poll

Would you guess from looking at all of these data that this effect would generalize?

Thankfully we don’t have to just guess!

How likely is it that we would observe a difference at least this large (45 ms) if the null hypothesis were true - in this case, assuming the null hypothesis is that there is no reaction time difference between expected and unexpected articles? This is what a p-value measures. A p-value of .05 indicates that there is a 5% chance of observing an effect of at least this size even if the null hypothesis were true.

How to do Inferential Statistics

Warning

I’m showing you how to write the code for some statistical tests. This does not grant you a license to practice statistics, and does not provide you with any of the necessary background to know which tests to run and how to interpret them! Proceed with caution.

Preparing for Statistical Tests

In order to know which statistical tests are appropriate for your analysis, you need some statistical knowledge to work with the following important information.

Your experimental design

you should have a clear idea of what hypothesis is being tested
statistical testing should be planned as part of the experimental design
you should be able to write your statistical analysis before collecting any data!

The nature of your data

statistical tests are based on assumptions both about your design, AND your data
to confirm that your chosen statistical tests are appropriate, you need to check for assumptions
some can be checked with statistical “pre-tests”, some with visualizations

Choosing a statistical test is much more complex than following a decision tree or flowchart. It requires careful consideration of your theory, design, and data.

Preparing Data for Statistical Tests

The shape of your data will depend to some extent on what tests you are doing, but some general guidelines beyond the initial data cleaning and wrangling:

check that variables are named correctly so you’re testing what you think you’re testing!
model/run tests on unsummarized data, unless pre-summarization is a specific step in the analysis type you are conducting
check that all variables are of the right type
- make sure that categorical variables are factors (or character), not numeric
- confirm dates are in the needed format
- confirm continuous numbers are the right numeric type
check that the levels of factors are in the correct order and that contrasts are set as desired
confirm units are consistent/as expected (temps in C or F? weight in kg or lb? population in thousands or millions?)
evaluate missing data and adjust if needed
transform, standardize, scale, and/or center variables as required

Coding Tests

We’re going to take the assumptions from the original Husband 2022 paper for our purposes, but simplify some of the analysis using a linear mixed effects model, like so:

library(lme4) # package for linear mixed effects
library(lmerTest) # package for p-values from lme4 models

A basic linear model would be generated using the lm() function:

m_lm <- lm(RT ~ expect, 
           data = filter(rt.s.filt, rgn.fix == 0))
summary(m_lm)

#> 
#> Call:
#> lm(formula = RT ~ expect, data = filter(rt.s.filt, rgn.fix == 
#>     0))
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -603.8 -159.8  -66.9   62.4 3964.2 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  694.419      5.423 128.041  < 2e-16 ***
#> expect1       44.803     10.847   4.131 3.72e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 289.2 on 2842 degrees of freedom
#> Multiple R-squared:  0.005968,   Adjusted R-squared:  0.005618 
#> F-statistic: 17.06 on 1 and 2842 DF,  p-value: 3.723e-05

The p-value is in the Pr(>|t|) column on the line with the fixed effect of expect1 (the effect of the “expected” level compared to the “unexpected” level). This is the probability of the t-statistic. Here is a simplified table output:

tbl_regression(m_lm, conf.int = FALSE)

Characteristic	Beta	p-value
expect
expected	45	<0.001

Closer to what is usually used in linguistic experimental work, we could also model the groupings of participants and items inherent in the data using a mixed effects model with lmer():

m_mixed <- lmer(RT ~ expect + (1 | Hash) + (1 + expect | item), 
           data = filter(rt.s.filt, rgn.fix == 0))
summary(m_mixed)

#> Linear mixed model fit by REML. t-tests use Satterthwaite's method [
#> lmerModLmerTest]
#> Formula: RT ~ expect + (1 | Hash) + (1 + expect | item)
#>    Data: filter(rt.s.filt, rgn.fix == 0)
#> 
#> REML criterion at convergence: 39573.4
#> 
#> Scaled residuals: 
#>     Min      1Q  Median      3Q     Max 
#> -2.1107 -0.4872 -0.1664  0.2535 13.9999 
#> 
#> Random effects:
#>  Groups   Name        Variance Std.Dev. Corr
#>  item     (Intercept)   908.5   30.14       
#>           expect1      3263.9   57.13   1.00
#>  Hash     (Intercept) 21297.0  145.93       
#>  Residual             61152.7  247.29       
#> Number of obs: 2844, groups:  item, 79; Hash, 36
#> 
#> Fixed effects:
#>             Estimate Std. Error     df t value Pr(>|t|)    
#> (Intercept)   694.17      24.99  36.29  27.774  < 2e-16 ***
#> expect1        43.99      11.30  91.85   3.893 0.000187 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Correlation of Fixed Effects:
#>         (Intr)
#> expect1 0.078 
#> optimizer (nloptwrap) convergence code: 0 (OK)
#> boundary (singular) fit: see help('isSingular')

tbl_regression(m_mixed, conf.int = FALSE)

Characteristic	Beta	p-value
expect
expected	44	<0.001

You can read about the parallels between linear regression and other statistical tests from this site by Jonas Kristoffer Lindeløv.

What the lmer() version adds with “mixed effects” is the ability to model the groupings in the data over participants and items in the same model, as crossed effects. These are described as “random effects” and with lmer() have the syntax (1|random_effect_variable) (similar to the Error in aov()). Random slopes can be added as well as `(1 + random_slope | random_intercept). This is a very complex statistical topic that I will not even attempt to cover, but we can look at the same kind of model for the noun as well:

m_noun <- lmer(RT ~ expect + (1 | Hash) + (1 + expect | item), 
           data = filter(rt.s.filt, rgn.fix == 1))
tbl_regression(m_noun, conf.int = FALSE)

Characteristic	Beta	p-value
expect
expected	363	<0.001

Often including all random slopes leads to errors and/or warnings. This is another very complex topic, but I will just point to the fact that some turn to Bayesian modelling to avoid these computational errors. A relatively easy package to use for Bayesian modelling that is similar to {lme4} is {brms}.

{brms} is a bit complex to setup as it is built on Bayesian analysis tools that are built with C++ code. There are instructions on setup out there, but the short version is it takes more than just installing the package itself. I’m not going to show you the relevant code because it’s not something you can just easily try out, or even interpret. Bayesian modelling has its own foundations and assumptions, and you would need to dedicate some serious time to learning it, beyond learning linear mixed effects modelling.

Next Steps in Statistics

If you don’t know any statistics

Really you should take at least an introductory statistics course to learn the foundations about distributions and probability. It’s easier to do this while you’re in college and already paying tuition! However, if you need to learn it on your own, there are some resources that are linguistics-oriented for the beginner.

How to do Linguistics with R by Natalia Levshina is a good R-based introduction that takes you through the basics of statistics, including analyses like t-tests and ANOVAs. It also covers a very broad range of model types. It is a bit outdated when it comes to the package used for mixed effects modelling (it came out before {lmer4} was released), but gives a good broad foundation, and you can figure out the {lme4} details from other sources. You probably will not want to work through this entire book, but start from the beginning and then learn the types of models that are relevant for the work you are doing.

Analyzing Linguistics Data by R.H. Baayen is also a good text, but it is even more outdated now unfortunately.

The PsyTeachR Analysis book is psychology-focused, but in a way that is very relevant to linguistic research as well. This book goes through statistics basics, like the Levshina book, but is very current and freely available online.

If you know some (general) statistics

If you have taken courses like STATS 250 or even some higher-level statistics courses, you (should) know the foundations of statistics, but likely do not know about the analyses typically used in linguistics. These approaches are usually not taught in general coursework, but rather in linguistics- or psychology-specific courses. For most types of linguistic work, you need to learn linear mixed effects models, and possibly generalized linear mixed effects. There is a ton of jargon overlap and confusion in this area, but these are sometimes also referred to as hierarchical modelling or hierarchical regression. The only UM course I’m aware of at the undergraduate level which covers these models is PSYCH 463, which requires previous stats knowledge (of regression).

Joe Fruehwald’s Statistical Modelling with R Course is a fantastic quick starter on the basics of linear mixed effects models, and what their purpose is. You can quickly read/skim the early chapters on data wrangling and learn more about the statistics from meetings 5-8.

Statistics for Linguistics by Bodo Winter is a great book for quickly getting familiar with (generalized) mixed effects modelling, though it really skims over some foundational statistical concepts. It is probably the most comprehensive “quick start” and what I usually have students in my lab work through, given I can help them with the basics as needed. You may feel a bit unmoored doing analysis on your own if you only read this and still don’t know what a t-test is, or how ANOVAs work. It is available as an e-book through the library.

PsyTeachR’s Learning Statistical Models Through Simulation in R is a great walkthrough from the basic linear model into mixed-effects models, including generalized models.

R for Prediction and Machine Learning

Some of you have inquired about how these skills can be used in a “business” context. Many businesses use experiments (such as A/B testing websites) and surveys which require data cleaning/wrangling and statistical tests similar to what we have been working on. Businesses also frequently want to do predictive modeling, which uses similar methods as inferential statistical tests to predict future datapoints about all kinds of things! Here is a case study using hotel bookings that is demonstrated using the {tidymodels} suite of tools. {tidymodels} helps with running many models and comparing their output.

Any direction you go with data, you will use the foundational principles of data wrangling and report generation that we have discussed in class. What goes in the middle of that sandwich is very domain specific and requires an understanding of the field itself (e.g., the hotel business, or psycholinguistics), the statistical tools relevant for that field, and the most useful technical tools to execute those analyses.

Older episodes (go back to the beginning) of the Not So Standard Deviations podcast have some interesting discussions about data analysis and data science in the context of both academia (primarily from Roger Peng) and business (primarily from Hilary Parker).

There are also a lot of RStudio::conf/Posit::conf talks about different business/industry topics that you can find on YouTube or the Posit website.