Class 14

Exploratory Data Analysis (EDA)

Materials for class on

2024-10-10

Agenda

Today we’ll focus on:

Homework 2 questions/work time
exploratory data analysis vs. p-hacking
EDA visualization

Homework 2

For Homework 2, you are putting together several of the concepts we have covered into one analysis report. Working on this should be some good practice for taking the leap into your first project, so make sure to ask questions about parts you are struggling with!

Mapping with coord_quickmap()

After getting your zipcodes resolved to latitude/longitude values, you can map with coord_quickmap() from ggplot2. The longitude values will be on the x-axis, and latitude on the y-axis as aesthetic mappings. Here’s an example from the ggplot2 book:

library(tidyverse)
library(maps) # you will need to install this package to run example
mi_counties <- map_data("county", "michigan") |>  
  select(lon = long, lat, group, id = subregion)
ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()

ggplot(mi_counties, aes(lon, lat, group = group)) +
  geom_polygon(fill = "white", colour = "grey50") + 
  coord_quickmap()

We can place a point on the map for a specific zipcode by looking up its coordinates and adding that as another layer:

library(zipcodeR)
myzips <- geocode_zip(c(48109, 48220, 48808))

ggplot(mi_counties) +
  geom_polygon(aes(lon, lat, group = group), fill = "white", colour = "grey50") + 
  geom_point(data = myzips, aes(lng, lat), color = "blue", shape = 18) +
  coord_quickmap()

What is Exploratory Data Analysis?

Poll

Have you heard of “p-hacking”?

One of the best ways to understand what p-hacking is would be this xkcd webcomic, significant, resulting in this shocking news about green jelly beans:

When analyzing data for hypothesis testing, confirmatory research, or inferential statistics, there are certain constraints that must be followed to ensure the proper interpretation of the tests. One of these is limiting the number of tests or comparisons to a clearly defined set, and reporting all results of such tests.

The reason for this constraint is that statistical tests are probability-based, and if you run unlimited unreported tests, you are more likely to observe “spurious” results.

Things like “p-values” are designed for hypothesis testing. If you want to make claims about the generalizability of the data, or predictions for other data, you need to learn more about inferential statistics (which we aren’t doing here).

What we are working on here is exploring the data we have. In the practice of research, this might be done to generate new hypotheses, or on a sample data set to provide insight on testing another dataset. Exploratory analysis on a test set of data might also be reported as exploratory or post-hoc analyses. That is fine as long as it is not reported to be confirmatory hypothesis testing.

Pencil and Paper EDA Planning

The process of EDA doesn’t start in your code - it starts in the ideas and questions you have about your data.

It’s best to start thinking it through with pen(cil) and paper, a whiteboard, or a tablet drawing app - something very easy to sketch ideas on.

Questions to consider:

What do I want to check for in my data, to confirm that it is as I expect?
What do I want to learn from the data?
What kind of variables are relevant to this question?
What variables do I want to see together in one plot?
How can I distinguish between multiple variable layers/dimensions/scales?
What summaries do I need to create to answer the question?

Types of Variables

An important consideration in exploring your variables is what type of value they have. Here are some terms that are used to describe variables (these are not specific to R):

continuous: variables that can take on any value between other existing values (can be fractional)
discrete/categorical: variables that can only take certain values, such as integers or choices/types
- nominal: non-continuous variables with no inherent order
- ordinal: non-continuous variables that have an inherent order, but may be unevenly spaced

Poll

What would be an example of an ordinal variable from our survey data?

Poll

What would be an example of a nominal variable from our survey data?

Reviewing Examples

We’ll first look at some of the examples in the reading together (links at top).

--- title: "Class 14" subtitle: "Exploratory Data Analysis (EDA)" date: 2024-10-10 date-format: "YYYY-MM-DD" image: "/images/xkcd_882_significant.png" editor: markdown: wrap: 72 editor_options: chunk_output_type: console --- ## Further Reading We will be working on this material today: - {{< fa external-link >}} [Exploratory Data Analysis (R for Data Science 2e)](https://r4ds.hadley.nz/eda.html) - {{< fa external-link >}} [PsyTeachR Reproducible Research - ggplot](https://psyteachr.github.io/reprores-v3/ggplot.html) - {{< fa external-link >}} [Not All Flexibility P-Hacking Is, Young Padawan (Daniel Lakens)](http://daniellakens.blogspot.com/2021/10/not-all-flexibility-p-hacking-is-young.html) ## Agenda Today we'll focus on: - Homework 2 questions/work time - exploratory data analysis vs. p-hacking - EDA visualization ## Homework 2 For Homework 2, you are putting together several of the concepts we have covered into one analysis report. Working on this should be some good practice for taking the leap into your first project, so make sure to ask questions about parts you are struggling with! ### Mapping with coord_quickmap() After getting your zipcodes resolved to latitude/longitude values, you can map with `coord_quickmap()` from `ggplot2`. The longitude values will be on the x-axis, and latitude on the y-axis as aesthetic mappings. Here's an example from the [ggplot2 book](https://ggplot2-book.org/maps.html#sec-polygonmaps): ```{r} #| warning: false #| message: false library(tidyverse) library(maps) # you will need to install this package to run example mi_counties <- map_data("county", "michigan") |> select(lon = long, lat, group, id = subregion) ggplot(mi_counties, aes(lon, lat)) + geom_point(size = .25, show.legend = FALSE) + coord_quickmap() ggplot(mi_counties, aes(lon, lat, group = group)) + geom_polygon(fill = "white", colour = "grey50") + coord_quickmap() ``` We can place a point on the map for a specific zipcode by looking up its coordinates and adding that as another layer: ```{r} #| warning: false #| message: false library(zipcodeR) myzips <- geocode_zip(c(48109, 48220, 48808)) ggplot(mi_counties) + geom_polygon(aes(lon, lat, group = group), fill = "white", colour = "grey50") + geom_point(data = myzips, aes(lng, lat), color = "blue", shape = 18) + coord_quickmap() ``` ## What is Exploratory Data Analysis? ::: {.callout-note .question} #### Poll Have you heard of "p-hacking"? a. yes b. no ::: One of the best ways to understand what p-hacking is would be this xkcd webcomic, [significant](https://xkcd.com/882/), resulting in this shocking news about green jelly beans: [![](/images/xkcd_882_significant.png){fig-alt="xkcd 882 - significant"}](https://xkcd.com/882) When analyzing data for **hypothesis testing**, **confirmatory research**, or **inferential statistics**, there are certain constraints that must be followed to ensure the proper interpretation of the tests. One of these is limiting the number of tests or comparisons to a clearly defined set, and reporting all results of such tests. The reason for this constraint is that statistical tests are probability-based, and if you run unlimited unreported tests, you are more likely to observe "spurious" results. Things like "p-values" are designed for hypothesis testing. If you want to make claims about the generalizability of the data, or predictions for other data, you need to learn more about inferential statistics (which we aren't doing here). What we are working on here is exploring the data we have. In the practice of research, this might be done to **generate** new hypotheses, or on a sample data set to provide insight on testing another dataset. Exploratory analysis on a test set of data might also be reported as exploratory or post-hoc analyses. That is fine as long as it is not reported to be confirmatory hypothesis testing. ## Pencil and Paper EDA Planning The process of EDA doesn't start in your code - it starts in the ideas and questions you have about your data. It's best to start thinking it through with pen(cil) and paper, a whiteboard, or a tablet drawing app - something very easy to sketch ideas on. Questions to consider: - What do I want to check for in my data, to confirm that it is as I expect? - What do I want to learn from the data? - What kind of variables are relevant to this question? - What variables do I want to see together in one plot? - How can I distinguish between multiple variable layers/dimensions/scales? - What summaries do I need to create to answer the question? ## Types of Variables An important consideration in exploring your variables is what type of value they have. Here are some terms that are used to describe variables (these are not specific to R): - *continuous*: variables that can take on any value between other existing values (can be fractional) - *discrete/categorical*: variables that can only take certain values, such as integers or choices/types - *nominal*: non-continuous variables with no inherent order - *ordinal*: non-continuous variables that have an inherent order, but may be unevenly spaced ::: {.callout-note .question} #### Poll What would be an example of an ordinal variable from our survey data? ::: ::: {.callout-note .question} #### Poll What would be an example of a nominal variable from our survey data? ::: ## Reviewing Examples We'll first look at some of the examples in the reading together (links at top).