3 Data structures

In this chapter, we will learn about data structures that will greatly aid our data science workflow.

3.1 Vectors

Vectors. Vectors are a sequence of values with the same type. We can create vectors using c(), which stands for “combine”.

(my_nums <- c(2.8, 3.2, 1.5, 3.8))

## [1] 2.8 3.2 1.5 3.8

To access the elements inside a vector, we can do something called “slicing”. To access a single item or multiple items, use the square bracket operator []. In general [] in R means, “give me a piece of something”. For example:

my_nums[4]

## [1] 3.8

my_nums[1:3]

## [1] 2.8 3.2 1.5

my_nums[c(1, 2, 3)] == my_nums[1:3]

## [1] TRUE TRUE TRUE

In my_nums[1:3], the 1:3 creates a vector from 1 to 3, which is then used to subset multiple items in a vector. Here are some additional useful functions:

length(my_nums)
mean(my_nums)
max(my_nums)
min(my_nums)
sum(my_nums)

Given the data in the interactive block, consider the following exercises:

Select “Pouria” and “Ana” from the names vector.
Select all individuals who have ages greater than 20. Assume the order of names and ages correlates by index.
Select all individuals whose age is not 21.
Find the average age of all individuals.

3.1.1 Missing values

So far we’ve worked with data with no missing values. In real life, however, we often have missing values (NA values). Unfortunately for us, R does not get along with NA values.

density_ha <- c(2.8, 3.2, 1.5, NA)
mean(density_ha)

## [1] NA

Why did we get NA? Well, it’s hard to say what a calculation including NA should be, so most calculations return NA when NA is in the data. One way to resolve this issue is to tell our function to remove the NA before executing:

mean(density_ha, na.rm = TRUE)

## [1] 2.5

3.2 Lists

Lists. Lists are a vector-like structure that can store elements of different typese (e.g., numbers, strings, vectors). We can create lists using the list() function.

sites <- c("a", "b", "c")
notes <- "It was a good day in the field today. Warm, sunny, lots of gators."
helpers <- 4
field_notes <- list(sites, notes, helpers)

You can index lists in the following ways:

field_notes[1]

## [[1]]
## [1] "a" "b" "c"

field_notes[[1]]

## [1] "a" "b" "c"

We can also give the values names and access them using the $ symbol–which is the preferred method–or via ["variable_name"] with subsetting. Try getting the my_sets vector from field_notes.

3.3 Data frames

This is where things get really exciting! We will use these data frames extensively in the upcoming chapters, so it’s important to pay attention here.

Data frames. A data frame is a table which groups equal length vectors together. You can think of data frames like a table in a spreadsheet. You can create data frames using the data.frame() function.

A data frame can contain both categorical and numerical values, whereas a vector can only contain variables of the same type (i.e., all numerical, all categorical, etc.).

sites <- c("a", "a", "b", "c")
area_ha <- c(1, 2, 3, 4)
density_ha <- c(2.8, 3.2, 1.5, NA)
# creating the data frame
surveys <- data.frame(sites, density_ha, area_ha)
surveys

##   sites density_ha area_ha
## 1     a        2.8       1
## 2     a        3.2       2
## 3     b        1.5       3
## 4     c         NA       4

Here are some useful commands to investigate a data frame:

str() returns the structure of a data frame.
length() returns the length of a data frame.
ncol() returns the number of columns of a data frame (same as length())
nrow() returns the number of rows of a data frame.

str(surveys)

## 'data.frame':    4 obs. of  3 variables:
##  $ sites     : chr  "a" "a" "b" "c"
##  $ density_ha: num  2.8 3.2 1.5 NA
##  $ area_ha   : num  1 2 3 4

ncol(surveys)

## [1] 3

nrow(surveys)

## [1] 4

Subsetting data frames is extremely similar to that for vectors. This time, however, we need to consider both rows and columns. We can access a specific member like this: my_data_frame[row, column]. Try playing around with the code below :)

3.3.1 External data

We can read in external data using theread.csv() function. The main argument is the location of the data, which is either a url or a path on your computer.

shrub_data <- read.csv('https://datacarpentry.org/semester-biology/data/shrub-dimensions-labeled.csv')

3.3.2 Factors

Let’s use the str() function to get more information about our variable shrub_data.

str(shrub_data)

## 'data.frame':    10 obs. of  4 variables:
##  $ shrubID: chr  "a1" "a2" "b1" "b2" ...
##  $ length : num  2.2 2.1 2.7 3 3.1 2.5 1.9 1.1 3.5 2.9
##  $ width  : num  1.3 2.2 1.5 4.5 3.1 2.8 1.8 0.5 2 2.7
##  $ height : num  9.6 7.6 2.2 1.5 4 3 4.5 2.3 7.5 3.2

Notice that the shrubID column has type Factor. A factor is a special data type (NOT data structure) in R for categorical data. Factors are useful for statistics, but can mess up some aspects of computation as we’ll see in future chapters.

shrub_data <- read.csv('https://datacarpentry.org/semester-biology/data/shrub-dimensions-labeled.csv', stringsAsFactors = FALSE)
str(shrub_data)

## 'data.frame':    10 obs. of  4 variables:
##  $ shrubID: chr  "a1" "a2" "b1" "b2" ...
##  $ length : num  2.2 2.1 2.7 3 3.1 2.5 1.9 1.1 3.5 2.9
##  $ width  : num  1.3 2.2 1.5 4.5 3.1 2.8 1.8 0.5 2 2.7
##  $ height : num  9.6 7.6 2.2 1.5 4 3 4.5 2.3 7.5 3.2