3 Data structures
In this chapter, we will learn about data structures that will greatly aid our data science workflow.
3.1 Vectors
Vectors. Vectors are a sequence of values with the same type. We can create vectors using c()
, which stands for “combine”.
<- c(2.8, 3.2, 1.5, 3.8)) (my_nums
## [1] 2.8 3.2 1.5 3.8
To access the elements inside a vector, we can do something called “slicing”. To access a single item or multiple items, use the square bracket operator []
. In general []
in R means, “give me a piece of something”. For example:
4] my_nums[
## [1] 3.8
1:3] my_nums[
## [1] 2.8 3.2 1.5
c(1, 2, 3)] == my_nums[1:3] my_nums[
## [1] TRUE TRUE TRUE
In my_nums[1:3]
, the 1:3
creates a vector from 1 to 3, which is then used to subset multiple items in a vector. Here are some additional useful functions:
length(my_nums)
mean(my_nums)
max(my_nums)
min(my_nums)
sum(my_nums)
Given the data in the interactive block, consider the following exercises:
- Select “Pouria” and “Ana” from the
names
vector. - Select all individuals who have ages greater than 20. Assume the order of names and ages correlates by index.
- Select all individuals whose age is not 21.
- Find the average age of all individuals.
3.1.1 Missing values
So far we’ve worked with data with no missing values. In real life, however, we often have missing values (NA
values). Unfortunately for us, R does not get along with NA
values.
<- c(2.8, 3.2, 1.5, NA)
density_ha mean(density_ha)
## [1] NA
Why did we get NA
? Well, it’s hard to say what a calculation including NA
should be, so most calculations return NA
when NA
is in the data. One way to resolve this issue is to tell our function to remove the NA
before executing:
mean(density_ha, na.rm = TRUE)
## [1] 2.5
3.2 Lists
Lists. Lists are a vector-like structure that can store elements of different typese (e.g., numbers, strings, vectors). We can create lists using the list()
function.
<- c("a", "b", "c")
sites <- "It was a good day in the field today. Warm, sunny, lots of gators."
notes <- 4
helpers <- list(sites, notes, helpers) field_notes
You can index lists in the following ways:
1] field_notes[
## [[1]]
## [1] "a" "b" "c"
1]] field_notes[[
## [1] "a" "b" "c"
We can also give the values names and access them using the $
symbol–which is the preferred method–or via ["variable_name"]
with subsetting. Try getting the my_sets
vector from field_notes
.
3.3 Data frames
This is where things get really exciting! We will use these data frames extensively in the upcoming chapters, so it’s important to pay attention here.
Data frames. A data frame is a table which groups equal length vectors together. You can think of data frames like a table in a spreadsheet. You can create data frames using the data.frame()
function.
A data frame can contain both categorical and numerical values, whereas a vector can only contain variables of the same type (i.e., all numerical, all categorical, etc.).
<- c("a", "a", "b", "c")
sites <- c(1, 2, 3, 4)
area_ha <- c(2.8, 3.2, 1.5, NA)
density_ha # creating the data frame
<- data.frame(sites, density_ha, area_ha)
surveys surveys
## sites density_ha area_ha
## 1 a 2.8 1
## 2 a 3.2 2
## 3 b 1.5 3
## 4 c NA 4
Here are some useful commands to investigate a data frame:
str()
returns the structure of a data frame.length()
returns the length of a data frame.ncol()
returns the number of columns of a data frame (same aslength()
)nrow()
returns the number of rows of a data frame.
str(surveys)
## 'data.frame': 4 obs. of 3 variables:
## $ sites : chr "a" "a" "b" "c"
## $ density_ha: num 2.8 3.2 1.5 NA
## $ area_ha : num 1 2 3 4
ncol(surveys)
## [1] 3
nrow(surveys)
## [1] 4
Subsetting data frames is extremely similar to that for vectors. This time, however, we need to consider both rows and columns. We can access a specific member like this: my_data_frame[row, column]
. Try playing around with the code below :)
3.3.1 External data
We can read in external data using theread.csv()
function. The main argument is the location of the data, which is either a url or a path on your computer.
<- read.csv('https://datacarpentry.org/semester-biology/data/shrub-dimensions-labeled.csv') shrub_data
3.3.2 Factors
Let’s use the str()
function to get more information about our variable shrub_data
.
str(shrub_data)
## 'data.frame': 10 obs. of 4 variables:
## $ shrubID: chr "a1" "a2" "b1" "b2" ...
## $ length : num 2.2 2.1 2.7 3 3.1 2.5 1.9 1.1 3.5 2.9
## $ width : num 1.3 2.2 1.5 4.5 3.1 2.8 1.8 0.5 2 2.7
## $ height : num 9.6 7.6 2.2 1.5 4 3 4.5 2.3 7.5 3.2
Notice that the shrubID
column has type Factor
. A factor is a special data type (NOT data structure) in R for categorical data. Factors are useful for statistics, but can mess up some aspects of computation as we’ll see in future chapters.
<- read.csv('https://datacarpentry.org/semester-biology/data/shrub-dimensions-labeled.csv', stringsAsFactors = FALSE)
shrub_data str(shrub_data)
## 'data.frame': 10 obs. of 4 variables:
## $ shrubID: chr "a1" "a2" "b1" "b2" ...
## $ length : num 2.2 2.1 2.7 3 3.1 2.5 1.9 1.1 3.5 2.9
## $ width : num 1.3 2.2 1.5 4.5 3.1 2.8 1.8 0.5 2 2.7
## $ height : num 9.6 7.6 2.2 1.5 4 3 4.5 2.3 7.5 3.2