5 Data exploration

Whenever you have “spreadsheetey” data, your default data structure in R should be the data frame. Data frames are awesome because

They neatly package related variables by maintaining a spreadsheet-like row-ordering. Data frames make it easy to filter rows and columns of interest.
Most functions for inference, modelling, and graphing will happily take a data frame object.
The set of packages known as the tidyverse takes data frames one step further and explicitly prioritizes the processing of data frames.

Recall that data frames, unlike vectors or matrices in R, can hold different variable types. For example, data frames can simultaneously hold character data (e.g., subject ID or name), quantitative data (e.g., white blood cell count), and categorical information (e.g., treated vs. untreated).

If you use data structures that only hold 1 type of data for data analysis, you might make the terrible mistake of spreading your data over multiple, unlinked objects. Why is this a mistake? Because you need to relate the row order in each object to every othere object, i.e., a nightmare.

5.1 Get the gapminder data

We will work with some of the data from the Gapminder project. The Gapminder project contains the gapminder dataset, which summarises the progression of countries over time for statistics like life expectancy and GDP.

If you haven’t installed gapminder or the tidyverse yet, you can do so like this:

install.packages("gapminder", dependencies=T)
install.packages("tidyverse", dependencies=T)

Now load the two packages.

library(gapminder)
library(tidyverse)

5.2 Explore gapminder

By loading the gapminder package, we now have access to a data frame by the same name.

class(gapminder)

## [1] "tbl_df"     "tbl"        "data.frame"

Notice that the class (type of data structure) of the gapminder object is a tibble, the tidyverse’s version of R’s data frame. A tibble is also a data frame.

Let’s check out the contents of gapminder:

gapminder

## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # … with 1,694 more rows

Although this seems like a lot of output, notice that tibbles provide a nice print method that shows the most important stuff and doesn’t fill up your console. Let’s make sense of the output:

The first line refers to what we’re printing—a tibble with 1704 rows and 6 columns.
Below each column heading, we see <fct> <fct> <int> <dbl> <int> <dbl>. These refer to the variable type of that column.
1. fct is short for “factor” (kind of like a categorical variable),
2. int is short for “integer”, and
3. dbl is short for “double” (a number with decimal places).

If you’re only interested in a summary of your data frame, use str(), head() or tail():

str() will provide a sensible description of almost anything and, worst case, nothing bad can actually happen. When in doubt, just use str() on your recently created objects to get ideas about what to do next.
head() displays the first 6 rows of your data frame by default, and
tail() shows the last 6 rows.

Play around with these functions in the interactive block below!

Just for your reference, if you want to change a data frame into a tibble for nicer printing, use as_tibble()!

as_tibble(my_data_frame)  # my_data_frame is the thing we want to make a tibble

Here are more ways to query basic info on a data frame:

Function	Description
`names()`	returns column names
`ncol()`	returns number of columns
`nrow()`	returns number of rows
`dim()`	returns # of rows by # of columns
`summary()`	returns a statistical summary of each column

Try playing around with these functions in the interactive window.

5.2.1 Importing and exporting data

We can export data frames to a comma-separated values (.csv) file.

write.csv(gapminder, file = "data/03_data-frames/gapminder.csv")

Comma-separated value files are the preferred way of importing and exporting data as it contains no formatting. Other common formats include tab-separated values (.tsv) and Excel files (.xls or .xlsx).

In addition to writing to a .csv file, we can also read .csv files into R. It’s as simple as read.csv()!

gapminder2 <- read.csv("data/03_data-frames/gapminder.csv", header = TRUE)
class(gapminder2)

## [1] "data.frame"

As you can see,read.csv() returns a data frame object by default. Notice that we specify that header = TRUE because our first row in the .csv file is a header. Also notice that we specified a file path to our .csv file.

5.2.2 Exploring variables in a data frame

To specify a single variable from a data frame, use the dollar sign $. Let’s explore gapminder’s lifeExp column by providing the proper arguments to the following functions:

Let’s continue to explore gapminder. Take a look at the year variable’s class:

class(gapminder$year)

## [1] "integer"

Notice that year holds integers. On the other hand, continent holds categorical information, which is called a factor in R.

class(gapminder$continent)

## [1] "factor"

Now, I want to illustrate something important:

summary(gapminder$year)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1952    1966    1980    1980    1993    2007

summary(gapminder$continent)

##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

Notice that the same function returned different outputs for different variable types—forgetting this observation can lead to confusion in the future, so make sure to check your data before analysis! Let’s check out a couple more useful functions and highlight important ideas in the meantime.

Within a given column/variable,

table() returns the number of observations,
levels() returns unique values, and
nlevels() returns the number of unique values.

table(gapminder$continent)

## 
##   Africa Americas     Asia   Europe  Oceania 
##      624      300      396      360       24

levels(gapminder$continent)

## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

nlevels(gapminder$continent)

## [1] 5

The levels of the factor continent are “Africa”, “Americas”, etc.—this is what’s usually presented to your eyeballs by R. Behind the scenes, R assigns integer values (i.e., 1, 2, 3, …) to each level. Never ever ever forget this fact. Look at the result from str(gapminder$continent) if you are skeptical:

str(gapminder$continent)

##  Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...

Specifically in modelling and figure-making, factors are anticipated and accommodated by the functions and packages you will want to exploit. Note that factors do NOT contain integers. Factors are a numerical way that R uses to represent categorical data.

Tl;dr, factors are categorical variables whereas levels are unique values within a factor.

Data frame summary. Use data frames and the tidyverse! The tidyverse provides a special type of data frame called a “tibble” that has nice default printing behavior, among other benefits.

When in doubt, str() something or print something.
Understand what your variable types are.
Use factors! (but with intention and care)
Do basic statistical and visual sanity checking of each variable.
Refer to variables by name (ex: gapminder$lifeExp) and NOT by column number. Your code will be more robust and readable.

5.3 Data frames with dplyr

dplyr is a package for data manipulation. It is built to be fast, highly expressive, and open-minded about how your data is stored. It is installed as part of the the tidyverse meta-package and it is among the packages loaded via library(tidyverse).

Here’s a bit of fun trivia: dplyr stands for “data frame pliers”.

5.3.1 Subsetting data

If you feel the urge to store a little snippet of your data:

canada <- gapminder[241:252, ]

Stop and ask yourself, “Do I want to create a separate subset of my original data?”

If “YES,” use proper data wrangling techniques. Alternatively, only subset the data as a temporary measure while you develop your elegant code.

If “NO,” then don’t subset!

Copies and excerpts of your data clutter your workspace, invite mistakes, and sow general confusion. Avoid whenever possible. Reality can also lie somewhere in between. You will find the workflows presented below can help you accomplish your goals with minimal creation of temporary, intermediate objects.

Recall therm() function, which removes unwanted variable(s).

x <- 'thing to not keep'
print(x)
rm(x)
# print(x)  # gives an error because x is deleted

5.3.2 Filter rows with `filter()`

filter(). filter() takes logical expressions and returns the rows for which all are TRUE. Use this function when you want to subset observations based on row values.

The first argument is the name of the data frame. The subsequent arguments are the expressions that filter the dataframe. For example, let’s filter all rows from gapminder where life expectancy is less than 29 years.

filter(gapminder, lifeExp < 29)

## # A tibble: 2 × 6
##   country     continent  year lifeExp     pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8 8425333      779.
## 2 Rwanda      Africa     1992    23.6 7290203      737.

When you run this line of code, dplyr filters the data and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you need to use the assignment operator, <-. Let’s try this out! Here we filter based on country and year:

rwanda_gthan_1979 <- filter(gapminder, country == "Rwanda", year > 1979)

Compare with some base R code to accomplish the same things:

gapminder[gapminder$lifeExp < 29, ]     # indexing is distracting
subset(gapminder, country == "Rwanda")  # almost same as filter; quite nice actually

What if you want to filter rows based on multiple values in a variable? For example, what if we want to filter all rows with either Rwanda or Afghanistan as countries?

filter(gapminder, country == "Rwanda" | country == "Afghanistan")

Recall that the Boolean operator, |, means “or”.

What if we want to keep more than just 2 countries? One way would be to string Boolean operators together like so: country == "Canada" | country == "Rwanda" | country == "Afghanistan | ... This, however, is very wordy. A useful shortcut is to use x %in% y. This selects every row where x is one of the values in y:

filter(gapminder, country %in% c("Rwanda", "Afghanistan"))
filter(gapminder, country %in% c("Canada", "Rwanda", "Afghanistan"))

Under no circumstances should you subset your data the way I did at first:

excerpt <- gapminder[241:252, ]

Why is this a terrible idea?

It is not self-documenting. What is so special about rows 241 through 252?
It is fragile. This line of code will produce different results if someone changes the row order of gapminder, e.g. sorts the data earlier in the script.

filter(gapminder, country == "Canada")

The above function explains itself and is fairly robust.

5.3.3 Pipe operator `%>%`

Before we go any further, we should exploit the new pipe operator that the tidyverse imports from the magrittr package by Stefan Bache. Here’s what it looks like: %>%. The RStudio keyboard shortcut: Ctrl + Shift + M (Windows), Cmd + Shift + M (Mac).

Let’s demo then I’ll explain:

gapminder %>% head()

## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

The above code is equivalent to head(gapminder). The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side. It literally drops it in as the first argument. You can think of an argument as your input to a function. If you remember your grade school math, functions in R do exactly what you’ve learned in school – it takes inputs (arguments/parameters) and spits an output, or a return value.

Never fear, you can still specify other arguments to this function! To see the first 3 rows of Gapminder, we could say head(gapminder, 3) or this:

gapminder %>% head(3)

## # A tibble: 3 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.

You are probably not impressed yet, but the magic will happen soon.

We’ve barely scratched the surface of dplyr but I want to point out key things you may start to appreciate.

dplyr’s verbs, such as filter() and select(), are what’s called pure functions. To quote from Wickham’s Advanced R Programming book:

The functions that are the easiest to understand and reason about are pure functions: functions that always map the same input to the same output and have no other impact on the workspace. In other words, pure functions have no side effects: they don’t affect the state of the world in any way apart from the value they return.

And finally, the data is always the very first argument of every dplyr function.

5.3.4 Select Columns with `select()`

select(). Use select() to subset the data on variables or columns.

Here’s a simple example:

select(gapminder, year, lifeExp)

## # A tibble: 1,704 × 2
##     year lifeExp
##    <int>   <dbl>
##  1  1952    28.8
##  2  1957    30.3
##  3  1962    32.0
##  4  1967    34.0
##  5  1972    36.1
##  6  1977    38.4
##  7  1982    39.9
##  8  1987    40.8
##  9  1992    41.7
## 10  1997    41.8
## # … with 1,694 more rows

And here’s the same operation, but written with the pipe operator and piped through head():

gapminder %>%
  select(year, lifeExp) %>%
  head(4)

## # A tibble: 4 × 2
##    year lifeExp
##   <int>   <dbl>
## 1  1952    28.8
## 2  1957    30.3
## 3  1962    32.0
## 4  1967    34.0

Think: “Take gapminder, then select the variables year and lifeExp, then show the first 4 rows.”

If we didn’t have the pipe operator, this is what the above function would look like:

head(select(gapminder, year, lifeExp), 4)

## # A tibble: 4 × 2
##    year lifeExp
##   <int>   <dbl>
## 1  1952    28.8
## 2  1957    30.3
## 3  1962    32.0
## 4  1967    34.0

As you can see, this is way harder to read. That’s why the pipe operator is so useful.

An important note is that select does not actually filter any rows. It simply selects columns.

select() used alongisde everything() is also quite handy if you want to move variables within your data frame. The everything() function selects all variables not explicitly mentioned in select(). For example, let’s move year and continent to the front of the gapminder tibble:

select(gapminder, year, continent, everything())

## # A tibble: 1,704 × 6
##     year continent country     lifeExp      pop gdpPercap
##    <int> <fct>     <fct>         <dbl>    <int>     <dbl>
##  1  1952 Asia      Afghanistan    28.8  8425333      779.
##  2  1957 Asia      Afghanistan    30.3  9240934      821.
##  3  1962 Asia      Afghanistan    32.0 10267083      853.
##  4  1967 Asia      Afghanistan    34.0 11537966      836.
##  5  1972 Asia      Afghanistan    36.1 13079460      740.
##  6  1977 Asia      Afghanistan    38.4 14880372      786.
##  7  1982 Asia      Afghanistan    39.9 12881816      978.
##  8  1987 Asia      Afghanistan    40.8 13867957      852.
##  9  1992 Asia      Afghanistan    41.7 16317921      649.
## 10  1997 Asia      Afghanistan    41.8 22227415      635.
## # … with 1,694 more rows

Here’s the data for Cambodia, but only certain variables…

gapminder %>%
  filter(country == "Cambodia") %>%
  select(year, lifeExp)

## # A tibble: 12 × 2
##     year lifeExp
##    <int>   <dbl>
##  1  1952    39.4
##  2  1957    41.4
##  3  1962    43.4
##  4  1967    45.4
##  5  1972    40.3
##  6  1977    31.2
##  7  1982    51.0
##  8  1987    53.9
##  9  1992    55.8
## 10  1997    56.5
## 11  2002    56.8
## 12  2007    59.7

… and what a typical base R call would look like:

gapminder[gapminder$country == "Cambodia", c("year", "lifeExp")]

## # A tibble: 12 × 2
##     year lifeExp
##    <int>   <dbl>
##  1  1952    39.4
##  2  1957    41.4
##  3  1962    43.4
##  4  1967    45.4
##  5  1972    40.3
##  6  1977    31.2
##  7  1982    51.0
##  8  1987    53.9
##  9  1992    55.8
## 10  1997    56.5
## 11  2002    56.8
## 12  2007    59.7

Package home on CRAN
- Note there are several vignettes, with the introduction being the most relevant right now.
Development home on GitHub.
RStudio Data Wrangling cheatsheet, covering dplyr and tidyr. Remember you can get to these via Help > Cheatsheets.
Excellent slides on pipelines and dplyr by TJ Mahr, talk given to the Madison R Users Group.
Blog post Hands-on dplyr tutorial for faster data manipulation in R by Data School, that includes a link to an R Markdown document and links to videos.
Cheatsheet from R Studio for dplyr.