Class 10

Data Hygiene, Files, and Projects

Materials for class on

2024-09-26

Preparation Materials

Data Import (R for Data Science 2e)
Broman and Woo (2018)

Other Resources

Agenda

Today we’ll focus on:

Quarto questions/issues?
Data Collection Formats
Data Dictionaries
Working with files and projects (to import, analyze, and share data)
Examining a “real” dataset

Data Collection

Poll

What would be the ISO-8601 standard format (recommended in the reading)for today’s date?

Discussion

What would you change about the data collection for our survey, based on your experiences with it and/or the reading?

Please stop using color to indicate important information in your data!

Data Dictionaries or Codebooks

When working with data from others or sharing your own, data dictionaries are very important to understanding what the data represent. Broman and Woo (2018) give some examples, and we will see a basic version in the data we import below.

You can create your own data dictionary in your analysis file in various ways within Quarto (or RMarkdown). The typical way is to use a table. Tables can be created in your code using markdown formatting or you can create a dataframe and output that in various ways.

Markdown tables are a bit tedious, so this is a case when you might want to switch to the Visual mode in RStudio. You can then use the Table menu to insert a table and fill it out in a WYSIWYG (“What you see is what you get”) interface.

Datasets that are included in R packages generally include the data dictionary in the help file, in more of a list format. For example, try:

?mtcars
?trees
?swiss

Working with someone else’s data

Download the full code and data for the following paper from the OSF Repository

Husband, E. M. (2022). Prediction in the maze: Evidence for probabilistic pre-activation from the English a/an contrast. Glossa Psycholinguistics, 1(1). https://doi.org/10.5070/G601153

First, we’ll learn a bit about the paper.

Then, let’s work together on reading in the data and examining it ourselves.

Importing Files

First Attempt

Try to read in the data file using the code downloaded - what happens?

Creating a project

If you are importing files that are not already associated with a project, you can create that structure in your own new project.

Doing this, you may encounter more challenges with file paths. A lot more detail about paths can be found in the File paths chapter of the R for Epidemiology online book.

R Projects with `{here}`

We’ll do an import the easy way, from an R project using the {here} package. The here package does a lot of work to simplify sharing files and projects with others, or using them across multiple computers. It also helps manage working directories across rendering and interactive mode.

We’ll work interactively on loading in the data from the .csv file, which contains reaction time data.

Create a new analysis Quarto file in the project, in the analysis folder.

Setup the code to use here:

# provide 'directions' to the current file from the root of the project
here::i_am("path/to/thisfile.R")
library(here)

Use here() with a relative path to load the data:

df <- read.csv(here("data", "delong maze 40Ss.csv"), 
              header = 1, 
              sep = ",", 
              comment.char = "#", 
              strip.white = T,
              col.names = c("Index","Time","Counter","Hash","Owner","Controller","Item","Element","Type","Group","FieldName","Value","WordNum","Word","Alt","WordOn","CorrWord","RT","Sent","TotalTime","Question","Resp","Acc","RespRT"))

Working with the Data

Create a new dataframe with only the rows that have the value “Maze” for the Controller variable, and don’t have the word “practice” in Type.
Take everything in Type in the remaining data, and use separate_wider_delim() to make separate columns from all of the variable names separated by periods (.).
Create a new column to indicate in characters whether the real word was on the left or right, based on the WordOn variable where 0 is left, 1 is right.

--- title: "Class 10" subtitle: "Data Hygiene, Files, and Projects" date: 2024-09-26 date-format: "YYYY-MM-DD" editor: markdown: wrap: 72 editor_options: chunk_output_type: console --- ## Preparation Materials - {{< fa external-link >}} [Data Import (R for Data Science 2e)](https://r4ds.hadley.nz/data-import.html) - {{< fa book-open >}} Broman and Woo (2018) ## Other Resources - {{< fa external-link >}} [Husband 2022 Paper](https://escholarship.org/uc/item/7dz7z3q3) - {{< fa external-link >}} [Husband 2022 OSF Repository](https://osf.io/frdtm/) ## Agenda Today we'll focus on: - Quarto questions/issues? - Data Collection Formats - Data Dictionaries - Working with files and projects (to import, analyze, and share data) - Examining a "real" dataset ## Data Collection ::: {.callout-note .question} ## Poll What would be the ISO-8601 standard format (recommended in the reading)for today's date? ::: ::: {.callout-note .question} ## Discussion What would you change about the data collection for our survey, based on your experiences with it and/or the reading? ::: Please <span style="color:red;">stop</span> using <span style="color:blue;">color</span> to indicate important information in your data! ## Data Dictionaries or Codebooks When working with data from others or sharing your own, data dictionaries are very important to understanding what the data represent. Broman and Woo (2018) give some examples, and we will see a basic version in the data we import below. You can create your own data dictionary in your analysis file in various ways within Quarto (or RMarkdown). The typical way is to use a table. Tables can be created in your code using [markdown formatting](https://quarto.org/docs/authoring/tables.html) or you can create a dataframe and output that in various ways. Markdown tables are a bit tedious, so this is a case when you might want to switch to the `Visual` mode in RStudio. You can then use the `Table` menu to insert a table and fill it out in a WYSIWYG ("What you see is what you get") interface. Datasets that are included in R packages generally include the data dictionary in the help file, in more of a list format. For example, try: ```{r} #| eval: false ?mtcars ?trees ?swiss ``` ## Working with someone else's data Download the full code and data for the following paper from the [OSF Repository](https://osf.io/frdtm/) Husband, E. M. (2022). Prediction in the maze: Evidence for probabilistic pre-activation from the English a/an contrast. Glossa Psycholinguistics, 1(1). <https://doi.org/10.5070/G601153> First, we'll learn a bit about the paper. Then, let's work together on reading in the data and examining it ourselves. ## Importing Files ### First Attempt Try to read in the data file using the code downloaded - what happens? ### Creating a project If you are importing files that are not already associated with a project, you can create that structure in your own new project. Doing this, you may encounter more challenges with file paths. A lot more detail about paths can be found in the [File paths chapter](https://www.r4epi.com/file-paths.html) of the *R for Epidemiology* online book. ### R Projects with `{here}` We'll do an import the easy way, from an R project using the `{here}` package. The [here package](https://here.r-lib.org/) does a lot of work to simplify sharing files and projects with others, or using them across multiple computers. It also helps manage working directories across rendering and interactive mode. We'll work interactively on loading in the data from the .csv file, which contains reaction time data. Create a new analysis Quarto file in the project, in the analysis folder. 1. Setup the code to use `here`: ```{r} #| eval: false # provide 'directions' to the current file from the root of the project here::i_am("path/to/thisfile.R") library(here) ``` 2. Use `here()` with a relative path to load the data: ```{r} #| eval: false df <- read.csv(here("data", "delong maze 40Ss.csv"), header = 1, sep = ",", comment.char = "#", strip.white = T, col.names = c("Index","Time","Counter","Hash","Owner","Controller","Item","Element","Type","Group","FieldName","Value","WordNum","Word","Alt","WordOn","CorrWord","RT","Sent","TotalTime","Question","Resp","Acc","RespRT")) ``` ### Working with the Data 3. Create a new dataframe with only the rows that have the value "Maze" for the `Controller` variable, and **don't** have the word "practice" in `Type`. 4. Take everything in `Type` in the remaining data, and use `separate_wider_delim()` to make separate columns from all of the variable names separated by periods (`.`). 5. Create a new column to indicate in characters whether the real word was on the left or right, based on the `WordOn` variable where 0 is left, 1 is right.

Preparation Materials

Other Resources

Agenda

Data Collection

Data Dictionaries or Codebooks

Working with someone else’s data

Importing Files

First Attempt

Creating a project

R Projects with {here}

Working with the Data

R Projects with `{here}`