5 Data exploration
Whenever you have “spreadsheetey” data, your default data structure in R should be the data frame. Data frames are awesome because
- They neatly package related variables by maintaining a spreadsheet-like row-ordering. Data frames make it easy to filter rows and columns of interest.
- Most functions for inference, modelling, and graphing will happily take a data frame object.
- The set of packages known as the tidyverse takes data frames one step further and explicitly prioritizes the processing of data frames.
Recall that data frames, unlike vectors or matrices in R, can hold different variable types. For example, data frames can simultaneously hold character data (e.g., subject ID or name), quantitative data (e.g., white blood cell count), and categorical information (e.g., treated vs. untreated).
If you use data structures that only hold 1 type of data for data analysis, you might make the terrible mistake of spreading your data over multiple, unlinked objects. Why is this a mistake? Because you need to relate the row order in each object to every othere object, i.e., a nightmare.
5.1 Get the gapminder data
We will work with some of the data from the Gapminder project. The Gapminder project contains the gapminder dataset, which summarises the progression of countries over time for statistics like life expectancy and GDP.
If you haven’t installed gapminder or the tidyverse yet, you can do so like this:
install.packages("gapminder", dependencies=T)
install.packages("tidyverse", dependencies=T)
Now load the two packages.
library(gapminder)
library(tidyverse)
5.2 Explore gapminder
By loading the gapminder package, we now have access to a data frame by the same name.
class(gapminder)
## [1] "tbl_df" "tbl" "data.frame"
Notice that the class (type of data structure) of the gapminder object is a tibble, the tidyverse’s version of R’s data frame. A tibble is also a data frame.
Let’s check out the contents of gapminder
:
gapminder
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # … with 1,694 more rows
Although this seems like a lot of output, notice that tibbles provide a nice print method that shows the most important stuff and doesn’t fill up your console. Let’s make sense of the output:
The first line refers to what we’re printing—a tibble with 1704 rows and 6 columns.
Below each column heading, we see
<fct> <fct> <int> <dbl> <int> <dbl>
. These refer to the variable type of that column.fct
is short for “factor” (kind of like a categorical variable),int
is short for “integer”, anddbl
is short for “double” (a number with decimal places).
If you’re only interested in a summary of your data frame, use str()
, head()
or tail()
:
str()
will provide a sensible description of almost anything and, worst case, nothing bad can actually happen. When in doubt, just usestr()
on your recently created objects to get ideas about what to do next.
head()
displays the first 6 rows of your data frame by default, andtail()
shows the last 6 rows.
Play around with these functions in the interactive block below!
Just for your reference, if you want to change a data frame into a tibble for nicer printing, use as_tibble()
!
as_tibble(my_data_frame) # my_data_frame is the thing we want to make a tibble
Here are more ways to query basic info on a data frame:
Function | Description |
---|---|
names() |
returns column names |
ncol() |
returns number of columns |
nrow() |
returns number of rows |
dim() |
returns # of rows by # of columns |
summary() |
returns a statistical summary of each column |
Try playing around with these functions in the interactive window.
5.2.1 Importing and exporting data
We can export data frames to a comma-separated values (.csv) file.
write.csv(gapminder, file = "data/03_data-frames/gapminder.csv")
Comma-separated value files are the preferred way of importing and exporting data as it contains no formatting. Other common formats include tab-separated values (.tsv) and Excel files (.xls or .xlsx).
In addition to writing to a .csv file, we can also read .csv files into R. It’s as simple as read.csv()
!
<- read.csv("data/03_data-frames/gapminder.csv", header = TRUE)
gapminder2 class(gapminder2)
## [1] "data.frame"
As you can see,read.csv()
returns a data frame object by default. Notice that we specify that header = TRUE
because our first row in the .csv file is a header. Also notice that we specified a file path to our .csv file.
5.2.2 Exploring variables in a data frame
To specify a single variable from a data frame, use the dollar sign $
. Let’s explore gapminder’s lifeExp
column by providing the proper arguments to the following functions:
Let’s continue to explore gapminder. Take a look at the year
variable’s class:
class(gapminder$year)
## [1] "integer"
Notice that year
holds integers. On the other hand, continent
holds categorical information, which is called a factor in R.
class(gapminder$continent)
## [1] "factor"
Now, I want to illustrate something important:
summary(gapminder$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1952 1966 1980 1980 1993 2007
summary(gapminder$continent)
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
Notice that the same function returned different outputs for different variable types—forgetting this observation can lead to confusion in the future, so make sure to check your data before analysis! Let’s check out a couple more useful functions and highlight important ideas in the meantime.
Within a given column/variable,
table()
returns the number of observations,levels()
returns unique values, andnlevels()
returns the number of unique values.
table(gapminder$continent)
##
## Africa Americas Asia Europe Oceania
## 624 300 396 360 24
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
nlevels(gapminder$continent)
## [1] 5
The levels of the factor continent
are “Africa”, “Americas”, etc.—this is what’s usually presented to your eyeballs by R. Behind the scenes, R assigns integer values (i.e., 1, 2, 3, …) to each level. Never ever ever forget this fact. Look at the result from str(gapminder$continent)
if you are skeptical:
str(gapminder$continent)
## Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
Specifically in modelling and figure-making, factors are anticipated and accommodated by the functions and packages you will want to exploit. Note that factors do NOT contain integers. Factors are a numerical way that R uses to represent categorical data.
Tl;dr, factors are categorical variables whereas levels are unique values within a factor.
Data frame summary. Use data frames and the tidyverse! The tidyverse
provides a special type of data frame called a “tibble” that has nice default printing behavior, among other benefits.
- When in doubt,
str()
something or print something. - Understand what your variable types are.
- Use factors! (but with intention and care)
- Do basic statistical and visual sanity checking of each variable.
- Refer to variables by name (ex:
gapminder$lifeExp
) and NOT by column number. Your code will be more robust and readable.
5.3 Data frames with dplyr
dplyr is a package for data manipulation. It is built to be fast, highly expressive, and open-minded about how your data is stored. It is installed as part of the the tidyverse
meta-package and it is among the packages loaded via library(tidyverse)
.
Here’s a bit of fun trivia: dplyr stands for “data frame pliers”.
5.3.1 Subsetting data
If you feel the urge to store a little snippet of your data:
<- gapminder[241:252, ] canada
Stop and ask yourself, “Do I want to create a separate subset of my original data?”
If “YES,” use proper data wrangling techniques. Alternatively, only subset the data as a temporary measure while you develop your elegant code.
If “NO,” then don’t subset!
Copies and excerpts of your data clutter your workspace, invite mistakes, and sow general confusion. Avoid whenever possible. Reality can also lie somewhere in between. You will find the workflows presented below can help you accomplish your goals with minimal creation of temporary, intermediate objects.
Recall therm()
function, which removes unwanted variable(s).
<- 'thing to not keep'
x print(x)
rm(x)
# print(x) # gives an error because x is deleted
5.3.2 Filter rows with filter()
filter(). filter()
takes logical expressions and returns the rows for which all are TRUE
. Use this function when you want to subset observations based on row values.
The first argument is the name of the data frame. The subsequent arguments are the expressions that filter the dataframe. For example, let’s filter all rows from gapminder
where life expectancy is less than 29 years.
filter(gapminder, lifeExp < 29)
## # A tibble: 2 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Rwanda Africa 1992 23.6 7290203 737.
When you run this line of code, dplyr filters the data and returns a new data frame. dplyr functions never modify their inputs, so if you want to save the result, you need to use the assignment operator, <-
. Let’s try this out! Here we filter based on country
and year
:
<- filter(gapminder, country == "Rwanda", year > 1979) rwanda_gthan_1979
Compare with some base R code to accomplish the same things:
$lifeExp < 29, ] # indexing is distracting
gapminder[gapmindersubset(gapminder, country == "Rwanda") # almost same as filter; quite nice actually
What if you want to filter rows based on multiple values in a variable? For example, what if we want to filter all rows with either Rwanda or Afghanistan as countries?
filter(gapminder, country == "Rwanda" | country == "Afghanistan")
Recall that the Boolean operator, |
, means “or”.
What if we want to keep more than just 2 countries? One way would be to string Boolean operators together like so: country == "Canada" | country == "Rwanda" | country == "Afghanistan | ...
This, however, is very wordy. A useful shortcut is to use x %in% y
. This selects every row where x is one of the values in y:
filter(gapminder, country %in% c("Rwanda", "Afghanistan"))
filter(gapminder, country %in% c("Canada", "Rwanda", "Afghanistan"))
Under no circumstances should you subset your data the way I did at first:
<- gapminder[241:252, ] excerpt
Why is this a terrible idea?
- It is not self-documenting. What is so special about rows 241 through 252?
- It is fragile. This line of code will produce different results if someone changes the row order of
gapminder
, e.g. sorts the data earlier in the script.
filter(gapminder, country == "Canada")
The above function explains itself and is fairly robust.
5.3.3 Pipe operator %>%
Before we go any further, we should exploit the new pipe operator that the tidyverse imports from the magrittr
package by Stefan Bache. Here’s what it looks like: %>%
. The RStudio keyboard shortcut: Ctrl + Shift + M (Windows), Cmd + Shift + M (Mac).
Let’s demo then I’ll explain:
%>% head() gapminder
## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
The above code is equivalent to head(gapminder)
. The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side. It literally drops it in as the first argument. You can think of an argument as your input to a function. If you remember your grade school math, functions in R do exactly what you’ve learned in school – it takes inputs (arguments/parameters) and spits an output, or a return value.
Never fear, you can still specify other arguments to this function! To see the first 3 rows of Gapminder, we could say head(gapminder, 3)
or this:
%>% head(3) gapminder
## # A tibble: 3 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
You are probably not impressed yet, but the magic will happen soon.
We’ve barely scratched the surface of dplyr but I want to point out key things you may start to appreciate.
dplyr’s verbs, such as filter()
and select()
, are what’s called pure functions. To quote from Wickham’s Advanced R Programming book:
The functions that are the easiest to understand and reason about are pure functions: functions that always map the same input to the same output and have no other impact on the workspace. In other words, pure functions have no side effects: they don’t affect the state of the world in any way apart from the value they return.
And finally, the data is always the very first argument of every dplyr function.
5.3.4 Select Columns with select()
select(). Use select()
to subset the data on variables or columns.
Here’s a simple example:
select(gapminder, year, lifeExp)
## # A tibble: 1,704 × 2
## year lifeExp
## <int> <dbl>
## 1 1952 28.8
## 2 1957 30.3
## 3 1962 32.0
## 4 1967 34.0
## 5 1972 36.1
## 6 1977 38.4
## 7 1982 39.9
## 8 1987 40.8
## 9 1992 41.7
## 10 1997 41.8
## # … with 1,694 more rows
And here’s the same operation, but written with the pipe operator and piped through head()
:
%>%
gapminder select(year, lifeExp) %>%
head(4)
## # A tibble: 4 × 2
## year lifeExp
## <int> <dbl>
## 1 1952 28.8
## 2 1957 30.3
## 3 1962 32.0
## 4 1967 34.0
Think: “Take gapminder
, then select the variables year and lifeExp, then show the first 4 rows.”
If we didn’t have the pipe operator, this is what the above function would look like:
head(select(gapminder, year, lifeExp), 4)
## # A tibble: 4 × 2
## year lifeExp
## <int> <dbl>
## 1 1952 28.8
## 2 1957 30.3
## 3 1962 32.0
## 4 1967 34.0
As you can see, this is way harder to read. That’s why the pipe operator is so useful.
An important note is that select does not actually filter any rows. It simply selects columns.
select()
used alongisde everything()
is also quite handy if you want to move variables within your data frame. The everything()
function selects all variables not explicitly mentioned in select()
. For example, let’s move year
and continent
to the front of the gapminder
tibble:
select(gapminder, year, continent, everything())
## # A tibble: 1,704 × 6
## year continent country lifeExp pop gdpPercap
## <int> <fct> <fct> <dbl> <int> <dbl>
## 1 1952 Asia Afghanistan 28.8 8425333 779.
## 2 1957 Asia Afghanistan 30.3 9240934 821.
## 3 1962 Asia Afghanistan 32.0 10267083 853.
## 4 1967 Asia Afghanistan 34.0 11537966 836.
## 5 1972 Asia Afghanistan 36.1 13079460 740.
## 6 1977 Asia Afghanistan 38.4 14880372 786.
## 7 1982 Asia Afghanistan 39.9 12881816 978.
## 8 1987 Asia Afghanistan 40.8 13867957 852.
## 9 1992 Asia Afghanistan 41.7 16317921 649.
## 10 1997 Asia Afghanistan 41.8 22227415 635.
## # … with 1,694 more rows
Here’s the data for Cambodia, but only certain variables…
%>%
gapminder filter(country == "Cambodia") %>%
select(year, lifeExp)
## # A tibble: 12 × 2
## year lifeExp
## <int> <dbl>
## 1 1952 39.4
## 2 1957 41.4
## 3 1962 43.4
## 4 1967 45.4
## 5 1972 40.3
## 6 1977 31.2
## 7 1982 51.0
## 8 1987 53.9
## 9 1992 55.8
## 10 1997 56.5
## 11 2002 56.8
## 12 2007 59.7
… and what a typical base R call would look like:
$country == "Cambodia", c("year", "lifeExp")] gapminder[gapminder
## # A tibble: 12 × 2
## year lifeExp
## <int> <dbl>
## 1 1952 39.4
## 2 1957 41.4
## 3 1962 43.4
## 4 1967 45.4
## 5 1972 40.3
## 6 1977 31.2
## 7 1982 51.0
## 8 1987 53.9
## 9 1992 55.8
## 10 1997 56.5
## 11 2002 56.8
## 12 2007 59.7
Package home on CRAN
- Note there are several vignettes, with the introduction being the most relevant right now.
Development home on GitHub.
RStudio Data Wrangling cheatsheet, covering dplyr and
tidyr
. Remember you can get to these via Help > Cheatsheets.Excellent slides on pipelines and dplyr by TJ Mahr, talk given to the Madison R Users Group.
Blog post Hands-on dplyr tutorial for faster data manipulation in R by Data School, that includes a link to an R Markdown document and links to videos.
Cheatsheet from R Studio for dplyr.