Class 20

More GitHub Pages and Tidy Text

Materials for class on

2024-11-05

Agenda

Today we’ll focus on:

GitHub pages continued
tidytext basics

GitHub Pages continued

Let’s go back to last class’s notes and resolve some outstanding challenges!

tidytext

Today we’ll start with the basics of what “tidy text” is, how to calculate and visualize tf-idf, and a bit about bigrams. We’ll continue next time with more about sentiment analysis.

What is tidy text?

There are many ways to analyze text, but here we are focusing on tidy text. Like tidy data are data with one observation per row, tidy data are data with one token per row. A token is a linguistic unit that could be a “word” or something bigger or smaller, depending on the context.

tf-idf

tf-idf stands for term frequency–inverse document frequency, and is intended to measure the “importance” of a term to a document, by comparing frequency across documents. If a word has a high tf-idf for a document, it is more specifically characteristic for that document. These are words that are very common in the specific document, but NOT as high frequency across all documents.

n-grams

n-grams are consecutive sequences of tokens, which can be counted in different “n” sizes - unigrams would be single token frequency, bigrams for sequences of two tokens, trigrams for three, etc.

Example from tutorial

library(tidyverse)
library(gutenbergr)
library(tidytext)
fairytales_raw <- gutenberg_download(c(28885, 2591, 1597), 
                                     mirror = "http://mirrors.xmission.com/gutenberg/")
fairytales_raw <- fairytales_raw %>% 
  mutate(gutenberg_id = recode(gutenberg_id,
                               "28885" = "Alice's Adventures in Wonderland",
                                                "2591" = "Grimm's Fairytales",
                                                "1597" = "Hans Christian Anderson's Fairytales"),
         gutenberg_id = as.factor(gutenberg_id))

fairytales_tidy <- fairytales_raw %>% 
  unnest_tokens(word, text) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  anti_join(stop_words)

#> Joining with `by = join_by(word)`

fairytales_freq <- fairytales_tidy %>% 
  group_by(gutenberg_id) %>% #including this ensures that the counts are by book and the id column is retained
  count(word, sort=TRUE)

fairytales_idf <- fairytales_freq %>% 
  bind_tf_idf(word, gutenberg_id, n)

fairytales_idf %>%
  group_by(gutenberg_id) %>% 
  arrange(desc(tf_idf)) %>% 
  top_n(20, tf_idf) %>% 
  ggplot(aes(x = tf_idf, y = reorder(word, tf_idf), fill = gutenberg_id)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~gutenberg_id, scales = "free") +
  theme_minimal()

Poll

How would we “unnest tokens” for languages with different writing systems? What challenges might there be?

Poll

Should stopwords be removed before or after a bigram analysis?

before
after

Demo: Chinese Tokenization

From https://smltar.com/tokenization.html#tokenization-for-non-latin-alphabets

library(jiebaR)

#> Loading required package: jiebaRD

words <- c("下面是不分行输出的结果", "下面是不输出的结果")

engine1 <- worker(bylines = TRUE)

segment(words, engine1)

#> [[1]]
#> [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
#> 
#> [[2]]
#> [1] "下面" "是"   "不"   "输出" "的"   "结果"

--- title: "Class 20" subtitle: "More GitHub Pages and Tidy Text" date: 2024-11-05 date-format: "YYYY-MM-DD" editor: markdown: wrap: 72 editor_options: chunk_output_type: console --- ```{r} #| echo: FALSE #| include: false library(tidyverse) ``` ## Further Reading We're following up on these readings/materials today: - {{< fa external-link >}} [Text Mining with R - Chapter 1](https://www.tidytextmining.com/tidytext.html) - {{< fa external-link >}}[RLadies Freiburg - Tidy Text Analysis: Word frequencies & n-grams](https://www.youtube.com/watch?v=Z6-lBcGOmAo) - {{< fa external-link >}}[RLadies Freiburg - Tidy Text Analysis: Word frequencies & n-grams (github resources)](https://github.com/rladies/meetup-presentations_freiburg/blob/master/2021-06-22_TextAnalysis_ngrams/Tidy%20Text%20Analysis%20-%20Word%20frequencies%20and%20n-grams.Rmd) ## Agenda Today we'll focus on: - GitHub pages continued - tidytext basics ## GitHub Pages continued Let's go back to last class's notes and resolve some outstanding challenges! ## tidytext Today we'll start with the basics of what "tidy text" *is*, how to calculate and visualize tf-idf, and a bit about bigrams. We'll continue next time with more about sentiment analysis. ### What is tidy text? There are many ways to analyze text, but here we are focusing on **tidy text**. Like tidy data are data with one observation per row, tidy data are data with *one token per row*. A token is a linguistic unit that could be a "word" or something bigger or smaller, depending on the context. ### tf-idf **tf-idf** stands for term frequency–inverse document frequency, and is intended to measure the "importance" of a term to a document, by comparing frequency across documents. If a word has a high tf-idf for a document, it is more specifically characteristic for that document. These are words that are very common in the specific document, but NOT as high frequency across all documents. ### n-grams **n-grams** are consecutive sequences of tokens, which can be counted in different "n" sizes - *unigrams* would be single token frequency, *bigrams* for sequences of two tokens, trigrams for three, etc. ### Example from tutorial ```{r} library(tidyverse) library(gutenbergr) library(tidytext) fairytales_raw <- gutenberg_download(c(28885, 2591, 1597), mirror = "http://mirrors.xmission.com/gutenberg/") fairytales_raw <- fairytales_raw %>% mutate(gutenberg_id = recode(gutenberg_id, "28885" = "Alice's Adventures in Wonderland", "2591" = "Grimm's Fairytales", "1597" = "Hans Christian Anderson's Fairytales"), gutenberg_id = as.factor(gutenberg_id)) fairytales_tidy <- fairytales_raw %>% unnest_tokens(word, text) %>% mutate(word = str_extract(word, "[a-z']+")) %>% anti_join(stop_words) fairytales_freq <- fairytales_tidy %>% group_by(gutenberg_id) %>% #including this ensures that the counts are by book and the id column is retained count(word, sort=TRUE) fairytales_idf <- fairytales_freq %>% bind_tf_idf(word, gutenberg_id, n) fairytales_idf %>% group_by(gutenberg_id) %>% arrange(desc(tf_idf)) %>% top_n(20, tf_idf) %>% ggplot(aes(x = tf_idf, y = reorder(word, tf_idf), fill = gutenberg_id)) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + facet_wrap(~gutenberg_id, scales = "free") + theme_minimal() ``` ::: {.callout-note .question} #### Poll How would we "unnest tokens" for languages with different writing systems? What challenges might there be? ::: ::: {.callout-note .question} #### Poll Should stopwords be removed before or after a bigram analysis? a. before b. after ::: ### Demo: Chinese Tokenization From <https://smltar.com/tokenization.html#tokenization-for-non-latin-alphabets> ```{r} library(jiebaR) words <- c("下面是不分行输出的结果", "下面是不输出的结果") engine1 <- worker(bylines = TRUE) segment(words, engine1) ```