13 Data visualization basics

“The simple graph has brought more information to the data analyst’s mind than any other device.” – John Tukey

This lab is all about preparing publication-ready figures with ggplot2 and related packages. ggplot2 uses elegant syntax and it implements “The Layered Grammar of Graphics”.

Before we begin, let’s review a few key points for good figures:

  1. Be clear and avoid confusion. Presenting too much information often results in messy figures. Figures inconsistent in colour, symbols, etc. can easily confuse readers.

  2. Only use additional aesthetic effects when necessary. Everything in a graph has a purpose. The primary objective of a figure is to inform, not to look fancy (though this is a plus). When in doubt, stick to black, white, and grey.

  3. Only use texts when necessary. Make text large. Never use Comic Sans as your font. Sans serif fonts such as Arial and Calibri are usually good bets.

Key point: the plot depends on the variables. Some plots are more appropriate for visualization than others. You have an obligation to display data responsibly. Check out The R Graph Gallery for a “dictionary” on different visualizations and the code to create them.

Recall that the tidyverse contains ggplot2 and dplyr among other packages. We’ll also load gapminder.

library(gapminder)
library(tidyverse)

Hopefully by now you have a good idea of what the gapminder dataset looks like. Here’s a quick refresher:

head(gapminder)
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.

13.1 A graphing template

Although we’ve presented graphs with ggplot2 in previous labs, let’s delve into the specifics of the syntax:

ggplot(data = <DATA>) +
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Everything in the initial ggplot() function is passed into the subsequent functions (i.e., GEOM_FUNCTION()). When calling ggplot(), you don’t explicitly need to write <ARGUMENT> = (ex: data = gapminder, x = lifeExp) as long as you have the variables in the correct order – just be careful you don’t mix up x and y!

ggplot2 works in a layer-by-layer manner. Take a look at this:

ggplot(data = gapminder, aes(x = pop, y = gdpPercap))

ggplot(data = gapminder, aes(x = pop, y = gdpPercap)) +
  geom_point()

Here, the first line initializes the object and the second line adds a layer of scatter points. Unlike plotting in base R, you don’t need to specify variables using the $ operator – ggplot2 is smart enough to call it automatically for you.

I’d like to bring your attention to the + operator. This is how you add layers to the ggplot. Use + liberally to reduce excessively long lines; breaking long commands at appropriate places makes your code much more readable.

13.2 Scatter plot

Use the scatter plot when you want to see the relationship between two continuous variables.

ggplot(gapminder, aes(gdpPercap, lifeExp)) +
  geom_point()

You also can save the object as a variable. To show the figure, use the show() function or simply call the object by its name.

p <- ggplot(gapminder, aes(gdpPercap, lifeExp)) +
  geom_point()
show(p)  # method 1
p        # also works

This data might benefit using a log scale. You can either log-transform (log(gdpPercap)) or simply draw the x axis in log scale (scale_x_log10()).

p <- ggplot(gapminder, aes(gdpPercap, lifeExp)) +
  geom_point() +
  scale_x_log10()
show(p)

13.2.1 Trend lines

It seems that there exists a positive correlation between the two variables – you might want to add a trend line. Remember, p is the object of scatter plot with x in log. We can just build from here. This just goes to show the beauty of layered graphical syntax.

p + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

You find that R used ‘gam’ model as default. ‘gam’ is the generalized additive model. Without going into the mathematical details, you would expect a curve from gam.

What if I want a straight line (i.e., linear regression)? I would want a straight line to start with. Let’s start with a linear model (lm).

p + geom_smooth(method = 'lm')
## `geom_smooth()` using formula 'y ~ x'

We can also create a generalized linear model (glm). glm can be useful if your variables are not normally distributed.

p + geom_smooth(method = 'glm')
## `geom_smooth()` using formula 'y ~ x'

It appears thatlm and glm don’t look too different.

Notice that lm has a little shaded region around it. This correspondends to the confidence interval of 1 standard error. To get rid of it, specify se = FALSE. While we’re at it, let’s change the colour of the line to red and decrease its thickness. Let’s also remove that pesky grey background:

scatter_trend <- p + theme_classic() +         # removes grey background
  geom_smooth(method = 'lm',
                se = F,       # remove confidence band
                col = 'red',  # change colour of line to red (hex colours also work: #FF0000)
                size = 0.75)  # set width of line
show(scatter_trend)
## `geom_smooth()` using formula 'y ~ x'

13.2.2 Facets

If you were paying careful attention, you may have noticed that we were using data from multiple years. However, this results in a messy, and potentially misleading, graph. Let’s fix this by plotting each year separately with the facet feature.

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  facet_wrap(vars(year), nrow = 3, ncol = 4) +
  geom_point(size = 0.5) +
  scale_x_log10() +
  theme_bw()  # another ggplot2 theme

We can use faceting to split by the combination of two variables. Here I will use facet_grid to put the same values of splitting variables on the same row/column. Note that you also need to specify vars(year) and vars(continent) instead of just year and continent. This is simply to help ggplot retrieve the levels in a particular column.

gap.52.77.07 <- gapminder %>% filter(year %in% c(1952, 1977, 2007))
ggplot(data = gap.52.77.07, aes(x = gdpPercap, y = lifeExp)) +
  facet_grid(rows = vars(year), cols = vars(continent)) +
  geom_point() +
  scale_x_log10() +
  theme_bw()

13.3 Aesthetic mappings

A scatter plot places dots using x and y coordinates. What if we want to show more detail, like which point(s) correspond to a particular group? For example, what if we want to see how life expectancy vs GDP per capita varies per continent in 1977?

gap.77 <- gapminder %>% filter(year == 1977)
ggplot(data = gap.77, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color=continent)) +
  scale_x_log10()

You can see the points from some continents, like Europe and Africa, cluster at distinct positions.

In addition to adding colours to show (categorical or continuous) groupings, we can also use

  1. Shape of points (categorical),
  2. Size of points (continuous), or
  3. Transparency/alpha of the points (continuous).

These options are all specified within aesthetic mappings (aes()). That is, aes() is the place you specify how you present your variables. More specifically, it’s how you map your variables to various aesthetics. To repeat an earlier point, aes() within the ggplot() function applies to ALL layers, while those in other layers only applies to that specific layer.

Here’s another example. Be careful with this as R interprets year as a continuous variable. Use as.factor() or factor() to circumvent this issue,

ggplot(data = gap.52.77.07, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(col = as.factor(year))) +
  scale_x_log10()

Going back to our original example of lifeExp vs gdpPercap by continent, I use stat_ellipse() to enclose the points within a 95% confidence interval.

ggplot(data = subset(gapminder, year == 1977), 
       aes(x = gdpPercap, y = lifeExp)) +
  # color only applies to the points, not the eclipses
  geom_point(aes(color=continent)) +
  # stat_ellipse uses level=0.95 by default
  stat_ellipse() +
  scale_x_log10()

We can also create multiple ellipses to group things together. Note that color = continent is the same as colour = continent and col = continent.

ggplot(data = subset(gapminder, year == 1977), 
       aes(x = gdpPercap, y = lifeExp, color = continent)) + 
  # color applies to both points and eclipses
  geom_point() +
  stat_ellipse() +
  scale_x_log10()
## Too few points to calculate an ellipse
## Warning: Removed 1 row(s) containing missing values (geom_path).

What if we want to also visualize population size in addition to grouping by continent?

ggplot(data = subset(gapminder, year == 1977), aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color=continent, size=pop)) +
  scale_x_log10()

Here we run into a minor problem: some dots are overlapping. We can fix this by applying geom_jitter() with partial transparency.

ggplot(data = subset(gapminder, year == 1977), aes(x = gdpPercap, y = lifeExp)) +
  geom_jitter(alpha=0.7, aes(color=continent, size=pop)) +
  scale_x_log10()

Notice that geom_jitter() adds some random variation, or jitter, to each point. While this is a handy method to address overplotting, don’t rely on it too heavily. This is illustrated in the next plot:

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_jitter(aes(color=continent, size=pop, alpha=year)) +
  scale_x_log10()

As you can see, this is a fancy figure, but also a messy figure. This plot has “information overload,” so we would like to simplify it. To reduce the amount of information, we come back to facets. Be careful with this one as R interprets year as a continuous variable. Use as.factor() or factor() to circumvent this issue,

ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_jitter(alpha=0.65, aes(size=pop, color=as.factor(year))) +
  facet_wrap(vars(continent)) +
  scale_x_log10()

Finally, we can change the labels and text formats. Here, we can save the figure to a .png format. Alternatively, you can save the figure using the Export tab in the Plots viewing panel in RStudio.

p <- ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
  geom_jitter(alpha=0.65, aes(size=pop, color=as.factor(year))) +
  facet_wrap(vars(continent)) +
  scale_x_log10() +
  labs(x = "GDP per capita", y = "Life Expectancy", size = "Population", color = "Year") +
  theme_bw() +                           # remove the gray background
  theme(text = element_text(size = 16))  # make texts larger
show(p)

ggsave("data/09_ggplot2/Life_GDP.png", plot = p)
## Saving 7 x 5 in image