Class 20

More GitHub Pages and Tidy Text

Materials for class on

2024-11-05

Further Reading

We’re following up on these readings/materials today:

Agenda

Today we’ll focus on:

  • GitHub pages continued
  • tidytext basics

GitHub Pages continued

Let’s go back to last class’s notes and resolve some outstanding challenges!

tidytext

Today we’ll start with the basics of what “tidy text” is, how to calculate and visualize tf-idf, and a bit about bigrams. We’ll continue next time with more about sentiment analysis.

What is tidy text?

There are many ways to analyze text, but here we are focusing on tidy text. Like tidy data are data with one observation per row, tidy data are data with one token per row. A token is a linguistic unit that could be a “word” or something bigger or smaller, depending on the context.

tf-idf

tf-idf stands for term frequency–inverse document frequency, and is intended to measure the “importance” of a term to a document, by comparing frequency across documents. If a word has a high tf-idf for a document, it is more specifically characteristic for that document. These are words that are very common in the specific document, but NOT as high frequency across all documents.

n-grams

n-grams are consecutive sequences of tokens, which can be counted in different “n” sizes - unigrams would be single token frequency, bigrams for sequences of two tokens, trigrams for three, etc.

Example from tutorial

library(tidyverse)
library(gutenbergr)
library(tidytext)
fairytales_raw <- gutenberg_download(c(28885, 2591, 1597), 
                                     mirror = "http://mirrors.xmission.com/gutenberg/")
fairytales_raw <- fairytales_raw %>% 
  mutate(gutenberg_id = recode(gutenberg_id,
                               "28885" = "Alice's Adventures in Wonderland",
                                                "2591" = "Grimm's Fairytales",
                                                "1597" = "Hans Christian Anderson's Fairytales"),
         gutenberg_id = as.factor(gutenberg_id))

fairytales_tidy <- fairytales_raw %>% 
  unnest_tokens(word, text) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words) 
#> Joining with `by = join_by(word)`
fairytales_freq <- fairytales_tidy %>% 
  group_by(gutenberg_id) %>% #including this ensures that the counts are by book and the id column is retained
  count(word, sort=TRUE)

fairytales_idf <- fairytales_freq %>% 
  bind_tf_idf(word, gutenberg_id, n)

fairytales_idf %>%
  group_by(gutenberg_id) %>% 
  arrange(desc(tf_idf)) %>% 
  top_n(20, tf_idf) %>% 
  ggplot(aes(x = tf_idf, y = reorder(word, tf_idf), fill = gutenberg_id)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~gutenberg_id, scales = "free") +
  theme_minimal()

Poll

How would we “unnest tokens” for languages with different writing systems? What challenges might there be?

Poll

Should stopwords be removed before or after a bigram analysis?

  1. before
  2. after

Demo: Chinese Tokenization

From https://smltar.com/tokenization.html#tokenization-for-non-latin-alphabets

library(jiebaR)
#> Loading required package: jiebaRD
words <- c("下面是不分行输出的结果", "下面是不输出的结果")

engine1 <- worker(bylines = TRUE)

segment(words, engine1)
#> [[1]]
#> [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
#> 
#> [[2]]
#> [1] "下面" "是"   "不"   "输出" "的"   "结果"