7 Introduction to statistics

In this chapter, we will learn about how to use statistics to analyze data.

7.1 Inferential statistics

Inferential statistics is all about using a sample to make inferences about the population from which the sample was drawn. For example, we might want to check if the efficacy of the Pfizer vaccine is better than that of the Moderna vaccine for preventing COVID-19 (spoiler alert: it’s not!).

Figure 7.1: A visualization of inferential statistics.

In inferential statistics, there are two main types of errors: sampling bias and sampling error. Both describe how there are differences between the sample and the population.

Sampling bias: your sample may not be representative of the population.
Sampling error: the difference between a sample and population value (e.g., you measure the heights of a sample to be 160 cm when the population height is in fact 155 cm).

7.2 Measures of central tendency

Measures of central tendency. The three measures of central tendency are your mean, median, and mode.

Mean: the arithmetic average.
Median: a value that splits ordered data into two equal halves.
Mode: the most frequent value in a dataset.

A visualization of measures of central tendency. Figure retrieved from this [Medium Digest article](https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa?gi=c05c427e4ad9).

Figure 7.2: A visualization of measures of central tendency. Figure retrieved from this Medium Digest article.

7.3 Measures of spread

The main measures of spread are range, interquartile range, standard deviation, and variance. As you may already know, the range is the difference between the largest and smallest values in a dataset.

Interquartile range. the difference between the third and first quartiles in a dataset (IQR = Q3 - Q1). IQR also represents the middle 50% of your dataset.

How to compute interquartile range. Figure retrieved from [Boston University statistics notes](https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_summarizingdata/bs704_summarizingdata7.html).

Figure 7.3: How to compute interquartile range. Figure retrieved from Boston University statistics notes.

We often visualize our data with a boxplot, which relies on the concept of quartiles and IQR.

A boxplot to represent data. Figure adapted from [Naysan Saran's article](https://naysan.ca/2020/06/28/interquartile-range-iqr-to-detect-outliers/).

Figure 7.4: A boxplot to represent data. Figure adapted from Naysan Saran’s article.

Given the following numbers, compute the interquartile range: 45, 47, 52, 52, 53, 55, 56, 58, 62, 80.

A boxplot to represent data. Figure adapted from [BYJU's article](https://byjus.com/maths/interquartile-range/).

Figure 7.5: A boxplot to represent data. Figure adapted from BYJU’s article.

Standard deviation and variance. Both standard deviation (SD) and variance (Var) measure the spread of individual data points from the mean. A bigger SD or Var means a bigger spread. A smaller SD or Var means a smaller spread.

The mathematical definition of population SD ( $σ$ ) and sample SD ( $s$ ) is:

$\begin{aligned} σ & = \sqrt{\frac{\sum_{i = 1}^{n} (x_{i} - μ)^{2}}{n}} \\ s & = \sqrt{\frac{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}{n - 1}} \end{aligned}$

The mathematical definition of population variance ( $σ^{2}$ ) and sample variance ( $s^{2}$ ) is:

$\begin{aligned} σ^{2} & = \frac{\sum_{i = 1}^{n} (x_{i} - μ)^{2}}{n} \\ s^{2} & = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}{n - 1} \end{aligned}$

Remarks:

$x_{i}$ : individual data points.
$μ$ : population mean.
$\bar{x}$ : sample mean.
$n$ : number of datum in your dataset.
We divide by $n - 1$ for sample standard deviation because to account for the greater variation of sample data as opposed to population data.

Given the following 7 numbers, compute the standard deviation: 9, 4, 5, 4, 12, 7, 8.

$\begin{aligned} μ & = \frac{9 + 4 + 5 + 4 + 12 + 7 + 8}{7} = 7 \\ σ & = \sqrt{\frac{\sum (x_{i} - μ)^{2}}{n}} \\ = \sqrt{\frac{(9 - 7)^{2} + (4 - 7)^{2} + \dots}{7}} \\ \approx 2.73 \end{aligned}$

The final measure of spread we will cover is something called standard error.

Standard error. Standard error (SE or SEM) measures the amount of spread of sample means from their true population mean.

$S E = \frac{σ}{\sqrt{n}}$

We will use SE extensively when we cover statistical tests. Here is an excellent article on the different use cases between SD and SEM.