Class 10
Data Hygiene, Files, and Projects
Preparation Materials
- Data Import (R for Data Science 2e)
- Broman and Woo (2018)
Other Resources
Agenda
Today we’ll focus on:
- Quarto questions/issues?
- Data Collection Formats
- Data Dictionaries
- Working with files and projects (to import, analyze, and share data)
- Examining a “real” dataset
Data Collection
What would be the ISO-8601 standard format (recommended in the reading)for today’s date?
What would you change about the data collection for our survey, based on your experiences with it and/or the reading?
Please stop using color to indicate important information in your data!
Data Dictionaries or Codebooks
When working with data from others or sharing your own, data dictionaries are very important to understanding what the data represent. Broman and Woo (2018) give some examples, and we will see a basic version in the data we import below.
You can create your own data dictionary in your analysis file in various ways within Quarto (or RMarkdown). The typical way is to use a table. Tables can be created in your code using markdown formatting or you can create a dataframe and output that in various ways.
Markdown tables are a bit tedious, so this is a case when you might want to switch to the Visual mode in RStudio. You can then use the Table menu to insert a table and fill it out in a WYSIWYG (“What you see is what you get”) interface.
Datasets that are included in R packages generally include the data dictionary in the help file, in more of a list format. For example, try:
Working with someone else’s data
Download the full code and data for the following paper from the OSF Repository
Husband, E. M. (2022). Prediction in the maze: Evidence for probabilistic pre-activation from the English a/an contrast. Glossa Psycholinguistics, 1(1). https://doi.org/10.5070/G601153
First, we’ll learn a bit about the paper.
Then, let’s work together on reading in the data and examining it ourselves.
Importing Files
First Attempt
Try to read in the data file using the code downloaded - what happens?
Creating a project
If you are importing files that are not already associated with a project, you can create that structure in your own new project.
Doing this, you may encounter more challenges with file paths. A lot more detail about paths can be found in the File paths chapter of the R for Epidemiology online book.
R Projects with {here}
We’ll do an import the easy way, from an R project using the {here} package. The here package does a lot of work to simplify sharing files and projects with others, or using them across multiple computers. It also helps manage working directories across rendering and interactive mode.
We’ll work interactively on loading in the data from the .csv file, which contains reaction time data.
Create a new analysis Quarto file in the project, in the analysis folder.
- Setup the code to use
here:
- Use
here()with a relative path to load the data:
df <- read.csv(here("data", "delong maze 40Ss.csv"),
header = 1,
sep = ",",
comment.char = "#",
strip.white = T,
col.names = c("Index","Time","Counter","Hash","Owner","Controller","Item","Element","Type","Group","FieldName","Value","WordNum","Word","Alt","WordOn","CorrWord","RT","Sent","TotalTime","Question","Resp","Acc","RespRT"))Working with the Data
Create a new dataframe with only the rows that have the value “Maze” for the
Controllervariable, and don’t have the word “practice” inType.Take everything in
Typein the remaining data, and useseparate_wider_delim()to make separate columns from all of the variable names separated by periods (.).Create a new column to indicate in characters whether the real word was on the left or right, based on the
WordOnvariable where 0 is left, 1 is right.