ideas_about_data.qmd

# Ideas about data

```{r}
#| results: "asis"
#| echo: false

source("_common.R")
status("polishing")
```

This chapter covers some important concepts data. Data is made up of:

-   properties we have measured or recorded, known as variables, and

-   observations, the individual things with those properties.

Data is most commonly (and helpfully) organised with variables in columns and each observation on a row.

We can classify a variable in two main ways: by what role the variable takes in analysis and by what kinds of value it can take and how frequently its possible values occur. The type of value and the likelihood of values appearing is known as a variable's *distribution*. Both of these determine how we summarise, plot and analyse data.

## Roles of variables in analysis

When we do research, we typically have variables that we choose or set and variables that we measure. The variables we choose or set are called independent or explanatory variables. The variables we measure are called dependent or response variables. For example, we might measure the concentration of enzymes and hormones in blood samples of individuals with different genotypes. The genotype acts as the explanatory variable and the blood measurements are response. We would be interested in whether the blood measures differ between genotypes.

## Distributions

Any variable has a *distribution* which describes the values the variable can take and the chance of them occurring. The distribution is a function (relationship) which captures how the values are mapped to the probability of them occurring. A distribution has a general type, given by the function and is further tuned by the *parameters* in the function. For example variables like height, length, concentration, mass and intensity follow a *normal distribution* also know as the bell-shaped curve so that they look similar. However, they have different means and standard deviations. The mean and the standard deviation are the parameters of the normal distribution.

An important distinction is between discrete and continuous types of data. Continuous variables are measurements that can take any value in their range (@fig-distrib-contin). Discrete variables can take only specific values (@fig-distrib-discrete).

```{r}
#| echo: false
#| label: fig-distrib-contin
#| fig-cap: "A continuous variable has a smooth distribution and the probability of getting a particular value or less is given by the area under the curve. The example is a normal distribution."
mean <- 4.2
sd <-  1.2

# function for shading under a normal distribution curve
dnorm_limit <- function(x) {
  y <- dnorm(x, mean, sd)
  y[x < 5 |  x > 10] <- NA
  return(y)
}

ggplot(data = data.frame(x = c(0, 9)), aes(x)) +
  stat_function(fun = dnorm, 
                n = 101, 
                args = list(mean, sd)) + 
  stat_function(fun = dnorm_limit, 
                geom = "area", 
                fill = pal2[1]) +
  annotate("text", size = 5,
           x = 1, y = 0.22,
           label = "Function") +
  annotate("segment",
           x = 1, y = 0.2,
           xend = 2.5, yend = 0.125,
           arrow = arrow(type = "closed", 
                         length = unit(0.03, "npc"))) +
  annotate("text", size = 5,
           x = 6.7, y = 0.26,
           label = "Shaded area under curve\nis the probability of getting\na value of X or more") +
  annotate("segment",
           x = 6.5, y = 0.22,
           xend = 5.5, yend = 0.1,
           arrow = arrow(type = "closed", 
                         length = unit(0.03, "npc"))) +
  scale_y_continuous(breaks = NULL, 
                     name = "",
                     expand = c(0, 0)) +
  scale_x_continuous(name = "Value a continuous variable can take",
                     expand = c(0, 0),
                     breaks = 5, labels = "X") +
  
  theme_classic() +
  theme(axis.text.x = element_text(size = 20))

```

```{r}
#| echo: false
#| label: fig-distrib-discrete
#| fig-cap: "A discrete variable has a stepped distribution and the probability of getting exactly a particular is given by the area of the bar at that value. The example is a Poisson distribtion."
mean <- 2


ggplot(data.frame(x = c(0:6)), aes(factor(x), fill = factor(x))) +
  geom_col(aes(y = dpois(x, mean)), colour = "black" ) +
  annotate("text", size = 5,
           x = 4.5, y = 0.26,
           label = "Probability of getting\na value of exactly Y") +
  annotate("segment",
           x = 4.4, y = 0.23,
           xend = 4, yend = 0.15,
           arrow = arrow(type = "closed", 
                         length = unit(0.03, "npc"))) +
  scale_y_continuous(breaks = NULL, 
                     name = "",
                     expand = c(0, 0),
                     limits = c(0, 0.3)) +
  scale_x_discrete(name = "Value a discrete variable can take",
                  breaks = 3, labels = "Y") +
  scale_fill_manual(values = c("white",
                               "white",
                               "white",
                               pal2[1],
                               "white",
                               "white",
                               "white")) +
  theme_classic() +
  theme(legend.position = "none",
        axis.text.x = element_text(size = 20))

```
## Summarising data

We summarise data by giving a measure of *central tendency* and a measure of *dispersion*.  A measure of central tendency is a single value that represents the middle or centre of a variable's distribution. The three most common measures are:

-   the mean, or more accurately, the arithmetic mean, $\bar{x} = \frac{\sum{x}}{n}$

-   the median: the middle value for a variable with values arranged in order of magnitude

-   the mode: the most frequent value in the variable.

Measures of dispersion describe the spread of values around the centre and indicate how much variability there is in the variable. The most common measures of dispersion are:

-   the range: the difference between the maximum value and the minimum value in a variable

-   the interquartile range: two values, the first quartile and the thrid quartile. The first quartile is half way between the median value and the lowest value when the values are arranged in order and the third quartile is halfway between the median value and the highest value 

-   the variance: the average of the squared differences between each value and the variable's mean, $s^2 = \frac{(\sum{x - \bar{x})^2}}{n - 1}$

-   the standard deviation: the square root of the variance.


## Discrete data

Discrete variables can take only specific values and can be categories, like genotype, or counts, like the number of petals.

### Categories: Nominal and Ordinal

Categorical data can be nominal and ordinal depending on whether they are ordered. Nominal variable have no particular order, for example, the eye colour of Drosophila or the species of bird. When summarising data on bird species, it wouldn't matter in what order the information was given or plotted. Ordinal variables have an order. The Likert scale @likert1932 used in questionnaires is one example. The possible responses are Strongly agree, Agree, Disagree and Strongly disagree; these have an order that you would use when plotting them (@fig-categories).

```{r}
#| echo: false
#| label: fig-categories
#| fig-cap: "The categories of bird species are unordered - or nominal - but those in a likert variable are ordinal."

birds <- data.frame(Response = c("Robin",
                                 "Chaffinch",
                                 "Thrush",
                                 "Wren"),
                    Frequency = c(12, 6, 10, 13))

a <- birds |> ggplot(aes(x = Response, y = Frequency)) +
  geom_col() +
  scale_x_discrete(name = "Species") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0 , 20)) +
  theme_classic() +
  theme(legend.position = "none")


likert <- data.frame(Response = factor(c("Strongly agree",
                                         "Agree",
                                         "Disagree",
                                         "Strongly disagree"), 
                                       levels = c("Strongly agree",
                                                  "Agree",
                                                  "Disagree",
                                                  "Strongly disagree")),
                     Frequency = c(15, 18, 10, 2))

b <- likert |> ggplot(aes(x = Response, y = Frequency)) +
  geom_col() +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0 , 20)) +
  theme_classic() +
  theme(legend.position = "none")

a + b
```

The most appropriate way to summarise nominal or ordinal data is to report the mode (most frequent value) or tabulate the number of each value (@tbl-likert).

```{r}
#| echo: false
#| label: tbl-likert
#| tbl-cap: "Frequency of reposnes."

knitr::kable(likert) |> kableExtra::kable_styling()

```

### Counts

Counts are one of the most common data types. They are quantitative but discrete because they can take only specific values. Counts tend to follow a distribution called the Poisson distribution meaning that there is an expected number of zeros, ones and twos etc that appear. This is true no matter what you count. The distribution of counts is not symmetrical and has a tail to the right - especially for lower count ranges.

A mean and standard deviation are not usually the best way to summarise a count variable because their distribution is not symmetrical. @fig-counts shows the number of groundsel plants in a hundred 1 metre square quadrats[^ideas_about_data-1]. 


```{r}
#| echo: false
#| label: fig-counts
#| fig-cap: "Counts are discrete. They often follow a Poisson distribution."
set.seed(5)
nquad <- 100
mu <- 0.7
plants <- data.frame(nplants = rpois(nquad, mu)) 
sample_m <- mean(plants$nplants)
sample_median <- median(plants$nplants)

plants |> ggplot(aes(x = factor(nplants))) +
  geom_bar() +
  geom_vline(xintercept = sample_m + 1,
             colour = pal4[1], 
             linewidth = 2) +
  geom_vline(xintercept = sample_median + 1,
             colour = pal4[2], 
             linewidth = 2) +
  annotate("text", x = 2.1, y = 55,
           label = paste("mean = ", sample_m),
           colour = pal4[1]) +
    annotate("text", x = 1.3, y = 55,
           label = paste("median = ", sample_median),
           colour = pal4[2]) +
  scale_x_discrete(name = "Number of plants per quadrat") +
  scale_y_continuous(expand = c(0, 0),
                     limits = c(0 , 60),
                     name = "Number of quadrats") +
  theme_classic() +
  theme(legend.position = "none")

```

The mean of `r sample_m` groundsel plants per quadrat does not really reflect that the majority of quadrats contained no plants. The average is being dragged up by the few quadrats containing 3 plants.

For count variables - and other variables that are not symmetrical - it is often better to give the median and interquartile range.

The median is the middle value when the values are arranged in order. The interquartile range (IQR) indicates the spread in the data. The lower quartile is half way between the lowest value and the middle value and the upper quartile is half way between the middle value and the highest value.

## Continuous data

Continuous variables are measurements that can take *any* value in their range so there are an infinite number of possible values. The values have decimal places. Variables like the length and mass of an organism, the volume and optical density of a solution, or the colour intensity of an image are continuous. Many response variables are continuous but continuous variables can also be explanatory. A large number of continuous variables follow the normal distribution.

### The normal distribution

For example, a variable like human height has values with decimal places which follow a normal distribution also known as the Gaussian distribution or the bell-shaped curve. Values of 1.65 metres occur more often than values of 2 metres and values of 3 metres never occur (@fig-normal-distribtion).

```{r}
#| echo: false
#| label: fig-normal-distribtion
#| fig-cap: "Human height follows a normal distribution."

m <- 1.65
sd <- 0.06
ggplot(data = data.frame(Height = c(m - 3 * sd, m + 3 * sd)), aes(Height)) +
  stat_function(fun = dnorm, n = 101, 
                args = list(mean = m, sd = sd)) +
  scale_y_continuous(breaks = NULL, name = "Likelihood of value occuring",
                     expand = c(0, 0)) +
  annotate("text", x = 1.5, y = 4.5,
           label = "Values are rare") +
    annotate("text", x = 1.65, y = 4.5,
           label = "Values are common") +
    annotate("text", x = 1.8, y = 4.5,
           label = "Values are rare") +
  theme_classic()
```

The mean and standard deviation are the best way to summarise a normally distributed variable.

## Theory and practice

The distinction between continuous and discrete values is clear in theory but in practice, the actual values you have may differ from the expected distribution for a particular variable. For example, we would expect the mass of cats to be continuous because any value in the range is possible. However, if our scales only measure to the nearest kilogram, then the values become discrete because the only the values possible are 0, 1, 2, 3, 4, 5, 6 and 7. The gap between values, 1kg, is big compared to the range of values we expect (@fig-contin-is-discrete)

```{r}
#| echo: false
#| label: fig-contin-is-discrete
#| fig-cap: "The mass of cats in kilograms is theoretically continuous (left). If we measure only to the nearest kilogram, mass becomes discrete because 1kg covers a large chunk of the range (right)."

m <- 4
sd <- 0.8
set.seed(1234)
a <- ggplot(data = data.frame(Mass = c(m - 3 * sd, m + 3 * sd)), aes(Mass)) +
  stat_function(fun = dnorm, n = 101, 
                args = list(mean = m, sd = sd)) + 
  scale_y_continuous(breaks = NULL, name = "", 
                     expand = c(0, 0)) +
  annotate("text", x = m - 2 * sd, y = 0.4,
           label = "Theory") +
  theme_classic()

b <- ggplot(data = data.frame(Mass = round(rnorm(1000, m, sd), 0)), aes(Mass)) +
  geom_histogram(binwidth = 1, colour = "black", fill = "white") +
  scale_y_continuous(breaks = NULL, name = "",
                     expand = c(0, 0)) +
  annotate("text", x = m - 2.5 * sd, y = 400,
           label = "Practice when\nmeasured to\nthe nearest 1kg") +
  theme_classic()

a + b
```

In contrast, the number of hairs on a human head ranges from about 80,000 to 150,000 so the difference between having 100,203 hairs and 100,204 hairs is so tiny the variable is practically continuous (@fig-discrete-is-contin).

```{r}
#| echo: false
#| label: fig-discrete-is-contin
#| fig-cap: "The number of hairs on a human head discrete. But one increment is a tiny proportion of the range of values so the number of hairs is practically continuous."


m <- 120000
sd <- 20000
set.seed(12)
ggplot() +
  geom_histogram(data = data.frame(hairs = round(rnorm(60000, m, sd), 0)),
                 aes(hairs),
                 bins = 120, colour = "black", fill = "white") +
  scale_y_continuous(breaks = NULL, name = "",
                     expand = c(0, 0)) +
  scale_x_continuous("Number of hairs on head") +
  theme_classic()


```

A rule of thumb is that if the mean count is above about 100, then the distribution of counts closely matches the normal distribution.


## Summary

1.  Data: Data consists of variables (properties measured or recorded) and observations (individual things with those properties). Data is commonly organised with variables in columns and observations in rows.

2.  Roles of Variables in Analysis: Variables in research can be independent or explanatory variables (chosen or set by researchers) and dependent or response variables (measured by researchers).

3.  Distributions: A variable's distribution describes the types of values it can take and the likelihood of each value occurring. Variables can be classified as discrete (specific values) or continuous (any value within a range). The shape of a distribution is determined by parameters.

4.  Discrete Data: Discrete variables can be categories or counts. Categorical data can be nominal (no order) or ordinal (ordered) and are best summarised with the mode or a table. Counts are often best summarised by the median and interquartile range.

5.  Continuous Data: Continuous variables can take any value within a range. The normal distribution is the most common continuous distribution and is best summarised with the mean and standard deviation.

6.  Theory and Practice: While the theoretical distinction between continuous and discrete values is clear, the actual values obtained in practice may deviate from the expected distribution. Measurement precision and range affect whether a variable is considered continuous or discrete. A rule of thumb suggests that if the mean count is above approximately 100, the distribution of counts approximates a normal distribution.


[^ideas_about_data-1]: A quadrat is a frame used in ecology to outline a standard unit of area for study of the distribution of plant species over a wider are. Typically, quadrats are placed randomly over the area and the number of individuals in each quadrat is recorded.