Class 25

Cool Stuff

Materials for class on

2024-11-21

Agenda

Today we’ll focus on:

  • Cool Stuff to Know About (maybe for hp2)
  • Rendering Quarto to different outputs
  • Homework 4 Questions (10 minutes)

Cool Stuff (for Linguistics)

This is a sampling of cool things you can do for linguistics in R that we haven’t even gotten to discuss. They all interact with/depend on the basic data wrangling and analysis tools we have been learning though!

udpipe for Universal Dependencies Tools

https://bnosac.github.io/udpipe/en/

{udpipe} provides some very cool language analysis tools. Possibly the most useful function we haven’t talked about in class is Part of Speech tagging, and it also creates dependency parses, which are a representation of intra-sentence relationships that are different from the constituency parses you may be more familiar with. These are based on the Universal Dependencies resource, which has the benefit of coverage for over 100 languages. Constituency- based treebanks are not as widely available.

#install.packages(udpipe)
#install.packages(textplot)
#install.packages(ggraph)
library(udpipe)
library(textplot)

We choose and download the model we want to use:

udmodel <- udpipe_download_model(language = "english")
udmodel <- udpipe_load_model(file = udmodel$file_model)

Then annotate our text:

y <- udpipe(x = "I made this dependency parse image using R, yay!", object = "english") 
y |> select(-sentence) |> kbl() |> kable_styling()
doc_id paragraph_id sentence_id start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 1 1 1 1 I I PRON PRP Case=Nom|Number=Sing|Person=1|PronType=Prs 2 nsubj NA NA
doc1 1 1 3 6 2 2 made make VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 0 root NA NA
doc1 1 1 8 11 3 3 this this DET DT Number=Sing|PronType=Dem 6 det NA NA
doc1 1 1 13 22 4 4 dependency dependency NOUN NN Number=Sing 6 compound NA NA
doc1 1 1 24 28 5 5 parse parse NOUN NN Number=Sing 6 compound NA NA
doc1 1 1 30 34 6 6 image image NOUN NN Number=Sing 2 obj NA NA
doc1 1 1 36 40 7 7 using use VERB VBG VerbForm=Ger 2 xcomp NA NA
doc1 1 1 42 42 8 8 R R PROPN NNP Number=Sing 7 obj NA SpaceAfter=No
doc1 1 1 43 43 9 9 , , PUNCT , NA 2 punct NA NA
doc1 1 1 45 47 10 10 yay yay INTJ UH NA 2 discourse NA SpaceAfter=No
doc1 1 1 48 48 11 11 ! ! PUNCT . NA 2 punct NA SpacesAfter=\n

Use the dependency plotter from {textplot}:

textplot_dependencyparser(y)
#> Loading required namespace: ggraph

The same author has created several other related packages: https://bnosac.github.io/udpipe/docs/doc9.html

quanteda for Text Analysis

https://quanteda.io/

{quanteda} is one of the most well-known text analysis packages for R. It actually is a family of packages that can be used for NLP, text management, modelling, statistics, and plotting.

readtext (quanteda)

https://readtext.quanteda.io/index.html

The {readtext} package can import text from all sorts of files, including PDFs. See the reference for examples of how to read in a whole folder of PDF files and extract the text! It can do the same for CSV, txt, Microsoft Word, JSON, etc.

Language Variation and Change Analysis in R

Tutorial by Matt Hunt Gardner:

https://lingmethodshub.github.io/content/R/lvc_r/010_lvcr.html

lingtypology package

https://github.com/ropensci/lingtypology

{lingtypology} provides tools for searching typology and cartography databases and working with that data.

Praat Helpers

The University of Minnesota ListenLab has gathered some Praat helper scripts: https://github.com/ListenLab/Praat

There are some other R tools for speech in other repos of theirs as well.

Phylogenetic Trees

A tutorial on creating phylogenetic trees by Erich Round:

https://ladal.edu.au/phylo

Working with Sociophonetic Data in R

Workshop by George Bailey: https://www.gbailey.uk/sociophon_workshop/

Plotting Vowels in R

https://lingmethodshub.github.io/content/R/vowel-plots-tutorial/

Various Case Studies from LADAL

https://ladal.edu.au/tutorials.html#6_Case_Studies

Working with Maps

Josef Fruehwald has some great teaching materials, and has some guidance for starting on maps here:

https://jofrhwld.github.io/AandS500_2023/class_notes/2023-02-28/

A JSON API data Tutorial

From Thomas Mock:

https://themockup.blog/posts/2020-12-13-extracting-json-from-websites-and-public-apis-with-r/

plotly

{plotly} is a plotting package that can be used with or without ggplot, based on the JavaScript library plotly.js.

https://plotly.com/r/

It gives tools for interactivity and for other types of plots such as 3D.

Here’s an example of the interactivity:

library(plotly)
g <- ggplot(faithful, aes(x = eruptions, y = waiting)) +
  stat_density_2d(aes(fill = after_stat(level)), geom = "polygon") + 
  xlim(1, 6) + ylim(40, 100)
ggplotly(g)

Here’s a 3D plot:

plot_ly(z = ~volcano, type = "surface")

shiny for interactive sites/apps

For the real interactive power and web-based apps, you want {shiny}. It is its own universe!

https://shiny.rstudio.com/

Packages with Python Connections

Using Python from R with reticulate

https://github.com/rstudio/reticulate

You may sometimes want to use Python tools/packages for a portion of your analysis, but still do the bulk of your analysis and visualization in R. Each has its strengths, and combining them is very handy!

The {reticulate} package allows you to run Python from within R. Quarto also permits this in slightly more streamlined fashion. Either way, you do need to have Python installed and deal with relevant Python environment management issues that we can’t get into in this class.

pangoling for word probabilities (in development)

The {pangoling} package by Bruno Nicenboim is in early development, but provides some easy tools to directly extract word probabilities from language models on HuggingFace, which is something that psycholinguists are often interested in doing.

https://bruno.nicenboim.me/pangoling/

It is not on CRAN, so you would need to install from GitHub using {remotes}.

spacyR

You can use the Python spacy package within R using {reticulate} more easily through the {spacyr} package:

https://cran.r-project.org/web/packages/spacyr/vignettes/using_spacyr.html

Transformers with R

The {text} package helps with using transformer language models from HuggingFace for NLP/ML. The package provides guidance for installing the Python packages needed to use it.

You could use this package for text generation, extracting word probabilities and embeddings, training word embeddings, computing semantic distances, and other tasks like classification.

For background on word embeddings, here’s a guide (conceptual, not R specific): https://jalammar.github.io/illustrated-word2vec/

This package is quite comprehensive, so takes a while to learn all of the capabilities.

https://r-text.org//index.html

Rendering Quarto to Other Outputs

Poll

Have you ever used LaTeX for typesetting?

Inline R Code Reminder

You can use what is called “inline R code” in Rmarkdown and Quarto documents to include variables from the environment directly in your markdown text. For example, let’s look at the msleep dataset:

library(tidyverse)
msleep
#> # A tibble: 83 × 11
#>    name   genus vore  order conservation sleep_total sleep_rem sleep_cycle awake
#>    <chr>  <chr> <chr> <chr> <chr>              <dbl>     <dbl>       <dbl> <dbl>
#>  1 Cheet… Acin… carni Carn… lc                  12.1      NA        NA      11.9
#>  2 Owl m… Aotus omni  Prim… <NA>                17         1.8      NA       7  
#>  3 Mount… Aplo… herbi Rode… nt                  14.4       2.4      NA       9.6
#>  4 Great… Blar… omni  Sori… lc                  14.9       2.3       0.133   9.1
#>  5 Cow    Bos   herbi Arti… domesticated         4         0.7       0.667  20  
#>  6 Three… Brad… herbi Pilo… <NA>                14.4       2.2       0.767   9.6
#>  7 North… Call… carni Carn… vu                   8.7       1.4       0.383  15.3
#>  8 Vespe… Calo… <NA>  Rode… <NA>                 7        NA        NA      17  
#>  9 Dog    Canis carni Carn… domesticated        10.1       2.9       0.333  13.9
#> 10 Roe d… Capr… herbi Arti… lc                   3        NA        NA      21  
#> # ℹ 73 more rows
#> # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>

I can use inline R code to tell you in the text that there are 83 rows in the dataset. What I really wrote there (you can check in the page code!) is in backticks with an r at the beginning like `r `.

I can also use this for computations, so I could calculate the mean of body weights in the dataframe, which comes out to 166.1363494.

And, I could use it to “print” statistical test output, like coefficient estimates and p-values. This is how you do simple linear models in R, by the way!

model <- lm(brainwt ~ bodywt, data = msleep)
summary(model)
#> 
#> Call:
#> lm(formula = brainwt ~ bodywt, data = msleep)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.78804 -0.08422 -0.07634 -0.02839  2.06190 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 8.592e-02  4.821e-02   1.782   0.0804 .  
#> bodywt      9.639e-04  5.027e-05  19.176   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3526 on 54 degrees of freedom
#>   (27 observations deleted due to missingness)
#> Multiple R-squared:  0.8719, Adjusted R-squared:  0.8696 
#> F-statistic: 367.7 on 1 and 54 DF,  p-value: < 2.2e-16

The estimate of the impact of body weight on brain weight is 0.00096 and the associated p-value is 9.2e-26. (One would ideally format these numbers better, which there are ways to do, but not today’s topic! Quick example is using format.pval() for the p-value to get <0.001.)

These statistics are associated with the line you see in this plot (though I trimmed the scales/data using coord_cartesian() so you can see it better):

ggplot(msleep, aes(x = bodywt, y = brainwt))+
  geom_point()+
  geom_smooth(method = "lm") +
  coord_cartesian(xlim = c(0,600), ylim = c(0,2))

You can see the Quarto documentation about this for more details/examples.

Other Quarto Output

We’ve been focusing on making webpages/html with Quarto, but the beauty of Quarto and markdown is that it can make all different kinds of output from the same or similar documents!

The main difference is in setting your YAML format, and rendering to that specific format. However there are some quirks to customize for output, especially for PDFs based on LaTeX and slides.

For example, to render a PDF (using LaTeX), your YAML would minimally look like:

---
title: "This is going to be a PDF!"
format: pdf
---

You can specify multiple formats, with options (or default) for each:

---
title: "This is going to be a PDF!"
format: 
  html:
    embed-resources: true
  docx: default
  gfm: default
  pdf: 
    documentclass: article
    cite-method: natbib
    indent: false
    papersize: letter
---

Try it out on a document yourself (start with only one option at a time). To use LaTeX for creating PDFs, you will need to install a TeX distribution. The simplest way to do this is using the Quarto command line in your terminal (not console) to install tinytex:

Terminal
quarto install tinytex

Then create a template Quarto document, specify pdf format, and try to render!

Poll

Did it work to create a PDF?

  1. yes
  2. no

We’ll look at some examples with various output, but here is the documentation for the different formats to follow up with: