A Statistical Background

A.1 Basic statistical terms

A.1.1 Mean

The mean, also known as (AKA) the average, is the most commonly reported measure of center. It is commonly called the “average” though this term can be a little ambiguous. The mean is the sum of all of the data elements divided by how many elements there are. If we have \(n\) data points, the mean is given by: \[\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\]

Note that you will often see shorthand notation for a sum of numbers using \(\sum\) notation. For example, we could rewrite the formula for \(\bar{x}\) as:

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}=\frac{\sum_{i = 1}^n x_i}{n}\]

We’ve simply replaced the subscript on each \(x\) with a generic index \(i\), and use the \(\sum\) notation to indicate we are summing \(x\)s with indices (i.e. subscripts) that go from 1 to \(n\). When summing numbers in statistics, we’re almost always dealing with indices that start with the value 1 and go up to a value equal to a sample size (e.g. \(n\)), so often you will see an even more shorthand version, where it’s assumed you’re summing from \(i = 1\) to \(i = n\):

\[\bar{x} = \frac{\sum x_i}{n}\]

A.1.2 Median

The median is calculated by first sorting a variable’s data from smallest to largest. After sorting the data, the middle element in the list is the median. If the middle falls between two values, then the median is the mean of those two values.

A.1.3 Standard deviation

We will next discuss the standard deviation of a sample dataset pertaining to one variable. The formula can be a little intimidating at first but it is important to remember that it is essentially a measure of how far to expect a given data value is from its mean:

\[Standard \, deviation = \sqrt{\frac{(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \cdots + (x_n - \overline{x})^2}{n - 1}} = \sqrt{\frac{\sum_{i= 1}^n(x_i - \overline{x})^2 }{n - 1}}\]

A.1.4 Five-number summary

The five-number summary consists of five values: minimum, first quartile AKA 25th percentile, second quartile AKA median AKA 50th percentile, third quartile AKA 75th, and maximum. The quartiles are calculated as

  • first quartile (\(Q_1\)): the median of the first half of the sorted data
  • third quartile (\(Q_3\)): the median of the second half of the sorted data

The interquartile range is defined as \(Q_3 - Q_1\) and is a measure of how spread out the middle 50% of values is. The five-number summary is not influenced by the presence of outliers in the ways that the mean and standard deviation are. It is, thus, recommended for skewed datasets.

A.1.5 Percentiles or rank

Just as you can sort the values of a variable form smallest to larges and find the middle element to get the median, you are dividing all the values into two groups with the same number of observations.

If you do that but instead of dividing into 2 groups, you divide it into 4 groups, you get each of the quartiles.

If you divide the sorted observations into 100 groups of equal number of observations, you get the percentiles. Depending on the application, the percentile can be read as the rank in the distribution.

A.1.6 Distribution

The distribution of a variable/dataset corresponds to generalizing patterns in the dataset. It often shows how frequently elements in the dataset appear. It shows how the data varies and gives some information about where a typical element in the data might fall. Distributions are most easily seen through data visualization.

A.1.7 Outliers

Outliers correspond to values in the dataset that fall far outside the range of “ordinary” values. In regards to a boxplot (by default), they correspond to values below \(Q_1 - (1.5 * IQR)\) or above \(Q_3 + (1.5 * IQR)\).

Note that these terms (aside from Distribution) only apply to quantitative variables.