-
Notifications
You must be signed in to change notification settings - Fork 16
/
data-terminology.qmd
267 lines (192 loc) · 20.2 KB
/
data-terminology.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
# Disease variables
## Disease quantification
Studies on the temporal progression or spatial spread of epidemics cannot be conducted without field-collected data, or, in some cases, simulated data. The study of plant disease quantification, termed Phytopathometry, is a subdivision of plant pathology concerned with the science of disease measurement. It has strong ties to the field of epidemiology [@bock2021].
Traditionally, disease quantification has been executed through visual evaluation. However, the past few decades have witnessed significant advancements in imaging and remote sensing technologies (which don't necessitate contact with the object), leaving a profound impact on this field. As such, disease quantity can now be gauged through estimation (visually, by the human eye) or measurement (through remote sensing technologies such as RGB, MSI, and HSI) [@fig-disease_measure].
While the utilization of digital or remote sensing technology for disease measurement or estimation provides a more objective approach, visual assessment is largely subjective. It is known to vary among human raters, as these raters differ in their innate abilities, training, and how they are influenced by the chosen method (e.g., scales). Disease is estimated or measured on a specimen within a population, or on a sample of specimens drawn from that population. The specimen in question can be a plant organ, an individual plant, a group of plants, a field, or a farm. The specific specimen type also determines the terminology used to describe disease quantity.
![Different approaches used to obtain estimates or measures of plant disease. RGB = red, green, blue; MSI = multispectral imaging; HSI = hyperspectral imaging.](imgs/disease_measure.png){#fig-disease_measure style="margin: 10px" fig-align="center" width="490"}
Finally, while developing new or refining existing disease assessment methods, it is crucial to evaluate the reliability of the assessments made by different raters or instruments, as well as their accuracy---specifically, how close the estimations or measurements are to the reference (or gold standard) values. Several methods are available for assessing the reliability, precision, and accuracy of these estimates or measurements (see [definitions](data-accuracy.html)). The choice of methods depends on the objective of the work, but largely on the type or nature of the data. These considerations will be further discussed.
## Disease variables
A common term used to reference the quantity of disease, irrespective of how it is expressed, is 'disease intensity'. This term, however, has minimal practical value as it only implies that the disease is more or less "intense". We require more specific terminology to standardize the reference to disease quantity and methodology. One of the primary tasks in disease assessment is classifying each specimen, often in a sample or within a population, as diseased or not diseased. This binary (yes/no or 1/0) evaluation may sufficiently express disease intensity if the goal is to ascertain the number or proportion of diseased specimens in a sample or a population.
This discussion brings us to two terms: disease incidence and prevalence. Incidence is typically used to denote the proportion or number (count) of plants (or their organs) deemed as observational units at the field scale or below. On the other hand, prevalence refers to the proportion or number of fields or farms with diseased plants within a larger production area or region [@nutter2006] @fig-prevalence_incidence. Therefore, prevalence is analogous to incidence, with the only difference being the spatial scale of the sampling unit.
![Schematic representation of how prevalence and incidence of plant diseases are calculated depending on the spatial scale of the assessment](imgs/prevalence_incidence.png){#fig-prevalence_incidence style="margin: 15px" fig-align="center" width="480"}
In many instances, it's necessary to determine the degree to which a specimen is diseased, a concept defined as disease severity. In certain contexts, severity is narrowly defined as the proportion of the unit that exhibits symptoms [@nutter2006]. However, a more expansive view of severity includes additional metrics such as nominal or ordinal scores, lesion count, and percent area affected (ratio scale). Ordinal scales are broken down into rank-ordered classes (see specific [section](data-ordinal.html)), defined based on either a percentage scale or descriptions of symptoms [@bock2021]. Occasionally, disease is expressed in terms of (average) lesion size or area, which could be regarded as a measure of severity. These variables represent different levels of measurements that provide varying degrees of information about the disease quantity - from low (nominal scale) to high (ratio scale) [@fig-severity].
![Scales and associated levels of measurement used to describe severity of plant diseases](imgs/severity.png){#fig-severity fig-align="center" style="margin: 15px" width="511"}
## Data types
The data used to express disease as incidence or any form of severity measurements can be discrete or continuous in nature.
Discrete variables are countable (involving integers) at a particular point in time. In other words, only a finite number of values (nominal or ordinal) is possible, and these cannot be subdivided. For instance, a plant or plant part can be either diseased or not diseased (nominal data). It's not possible to count 1.5 diseased plants. Furthermore, a plant classified as diseased may exhibit a certain number of lesions (count data), or be categorized into a specific severity class (ordinal data, common in ordinal scales, e.g., 1-9). Disease data in the form of counts often relates to the number of infections per sampling units. Most commonly, these counts refer to the assessed pathogen population, such as the number of airborne or soilborne propagules.
In contrast to discrete variables, continuous variables can be measured on a scale and can assume any numeric value between two points. For example, the size of a lesion on a plant can be measured at a very precise scale (cm or mm). An estimate of severity on a percent scale (% diseased area) can take any value between non-zero and 100%. Although incidence at the individual level is discrete, at the sample level it can be treated as continuous, as it can assume any value in proportion or percentage.
Disease variables can also be characterized by a statistical distribution, which are models that provide the probability of a specific value (or a range of values) being drawn from a particular distribution. Understanding statistical or mathematical distributions is a crucial step in improving our grasp of data collection methods, experiment design, and data analysis processes such as data summarization or hypothesis testing.
## Statistical distributions and simulation
### Binomial distribution
For incidence (and prevalence), the data is binary at the individual level, as there are only two possible outcomes in a *trial*: the plant or plant part is disease or not diseased. The statistical distribution that best describe the incidence data at the individual level is the *binomial distribution*.
Let's simulate the binomial outcomes for a range of probabilities in a sample of 100 units, using the `rbinom()` function in R. For a single trial (e.g., status of plants in a single plant row), the `size` argument is set to 1.
```{r}
#| warning: false
#| message: false
library(tidyverse)
library(r4pde)
set.seed(123) # for reproducibility
P.1 <- rbinom(100, size = 1, prob = 0.1)
P.3 <- rbinom(100, size = 1, prob = 0.3)
P.7 <- rbinom(100, size = 1, prob = 0.7)
P.9 <- rbinom(100, size = 1, prob = 0.9)
binomial_data <- data.frame(P.1, P.3, P.7, P.9)
```
We can then visualize the plots.
```{r}
#| label: fig-binomial
#| fig-cap: "Binomial distribution to describe binary data"
binomial_data |>
pivot_longer(1:4, names_to = "P",
values_to = "value") |>
ggplot(aes(value)) +
geom_histogram(fill = "#339966",
bins = 10) +
facet_wrap( ~ P) +
theme_r4pde()
```
### Beta distribution
In many studies, it's often useful to express these quantities as a proportion of the total population or sample size, rather than absolute numbers. This helps standardize the data, making it easier to compare between different populations or different time periods.
For example, if we're studying a plant disease, we could express disease incidence as the proportion of plants that are newly diseased during a given time period. Similarly, disease severity could be expressed as the proportion of each plant's organ area that is affected by the disease. These proportions are ratio variables, as they can take on any value between 0 and 1, and ratios of these variables are meaningful.
The Beta distribution is a probability distribution that is defined between 0 and 1, which makes it ideal for modeling data that represents proportions. It's a flexible distribution, as its shape can take many forms depending on the values of its two parameters, often denoted as alpha and beta.
Let's simulate some data using the `rbeta()` function.
```{r}
beta1.5 <- rbeta(n = 1000, shape1 = 1, shape2 = 5)
beta5.5 <- rbeta(n = 1000, shape1 = 5, shape2 = 5)
beta_data <- data.frame(beta1.5, beta5.5)
```
Notice that there are two shape parameters in the beta distribution: `shape1` and `shape2` to be defined. This makes the distribution very flexible and with different potential shapes as we can see below.
```{r}
#| warning: false
#| label: fig-betabin
#| fig-cap: "Binomial distribution to describe proportion data"
beta_data |>
pivot_longer(1:2, names_to = "P",
values_to = "value") |>
ggplot(aes(value)) +
geom_histogram(fill = "#339966",
color = "white",
bins = 15) +
scale_x_continuous(limits = c(0, 1)) +
facet_wrap( ~ P)+
theme_r4pde()
```
### Beta-binomial distribution
The Beta-Binomial distribution is a mixture of the Binomial distribution with the Beta distribution acting as a prior on the probability parameter of the binomial. Disease probabilities can vary across trials due to a number of unobserved or unmeasured factors. This variability can result in overdispersion, a phenomenon where the observed variance in the data is greater than what the binomial distribution expects.
This is where the Beta-Binomial distribution comes in handy. By combining the Beta distribution's flexibility in modeling probabilities with the Binomial distribution's discrete event modeling, it provides an extra layer of variability to account for overdispersion. The Beta-Binomial distribution treats the probability of success (disease occurrence in this context) as a random variable itself, following a Beta distribution. This means the probability can vary from trial to trial.
Therefore, when we observe data that shows more variance than the Beta distribution can account for, or when we believe there are underlying factors causing variability in the probability of disease occurrence, the Beta-Binomial distribution is a more appropriate model. It captures both the variability in success probability as well as the occurrence of the discrete event (disease incidence).
When combined with the Binomial distribution, which handles discrete events (e.g. whether an individual is diseased or not), the Beta-Binomial distribution allows us to make probabilistic predictions about these events. For example, based on prior data (the Beta distribution), we can estimate the likelihood of a particular individual being diseased (the Binomial distribution).
In R, the `rBetaBin` function of the *FlexReg* package generates random values from the beta-binomial distribution. The arguments of the function are `n`, or the number of values to generate; if length(`n`) \> 1, the length is taken to be the number required. `size` is he total number of trials. `mu` is the mean parameter. It must lie in (0, 1). `theta` is the overdispersion parameter. It must lie in (0, 1). `phi` the precision parameter. It is an alternative way to specify the theta parameter. It must be a positive real value.
```{r}
#| warning: false
#| message: false
library(FlexReg)
betabin3.6 <- rBetaBin(n = 100, size = 40, mu = .3, theta = .6)
betabin7.3 <- rBetaBin(n = 100, size = 40, mu = .7, theta = .3)
betabin_data <- data.frame(betabin3.6, betabin7.3)
```
```{r}
#| warning: false
#| label: fig-beta
#| fig-cap: "Beta-binomial distribution to describe proportion data"
betabin_data |>
pivot_longer(1:2, names_to = "P",
values_to = "value") |>
ggplot(aes(value)) +
geom_histogram(fill = "#339966",
color = "white",
bins = 15) +
facet_wrap( ~ P) +
theme_r4pde()
```
### Poisson distribution
When conducting studies in epidemiology, specifically plant diseases, researchers often collect data on the number of diseased plants, infected plant parts, or individual symptoms, such as lesions. These variables are counted in whole numbers - 1, 2, 3, etc., making them discrete variables. Discrete variables contrast with continuous variables that can take any value within a defined range and can include fractions or decimals. In addition to being discrete, these variables are also non-negative, meaning they cannot take negative values. After all, you can't have a negative number of diseased plants or lesions. Given these characteristics, a suitable distribution to model such data is the Poisson distribution. This distribution is particularly suitable for counting the number of times an event occurs in a given time or space.
In R, we can used the `rpois()` function to obtain 100 random observations following a Poisson distribution. For such, we need to inform the number of observation (n = 100) and `lambda`, the vector of means.
```{r}
poisson5 <- rpois(100, lambda = 10)
poisson35 <- rpois(100, lambda = 35)
poisson_data <- data.frame(poisson5, poisson35)
```
```{r}
#| label: fig-poisson
#| fig-cap: "Poisson distribution to describe count data"
poisson_data |>
pivot_longer(1:2, names_to = "P",
values_to = "value") |>
ggplot(aes(value)) +
geom_histogram(fill = "#339966",
color = "white",
bins = 15) +
facet_wrap( ~ P) +
theme_r4pde()
```
### Negative binomial distribution
While the Poisson distribution is indeed suitable for modeling count data, it assumes that the mean and variance of the data are equal. However, in real-world scenarios, especially in epidemiology, it is common to encounter overdispersed data - where the variance is greater than the mean. This could occur, for instance, if there's greater variability in disease incidence across different plant populations than would be expected under the Poisson assumption.
In such cases, the Negative Binomial distribution is a better alternative. The Negative Binomial distribution is a discrete probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of failures occurs.
One of the key features of the Negative Binomial distribution is its ability to handle overdispersion. Unlike the Poisson distribution, which has one parameter (lambda, representing the mean and variance), the Negative Binomial distribution has two parameters. One parameter is the mean, but the other (often denoted as 'size' or 'shape') governs the variance independently, allowing it to be larger than the mean if necessary. Thus, it provides greater flexibility than the Poisson distribution for modeling count data and can lead to more accurate results when overdispersion is present.
In R, we can use the `rnbinom()` function to generate random variates from a Negative Binomial distribution. This function requires the number of observations (`n`), the target for the number of successful trials (`size`), and the probability of each success (`prob`).
Here's an example:
```{r}
# Generate 100 random variables from a Negative Binomial distribution
negbin14.6 <- rnbinom(n = 100, size = 14, prob = 0.6)
negbin50.6 <- rnbinom(n = 100, size = 50, prob = 0.6)
negbin_data <- data.frame(negbin14.6, negbin50.6)
```
```{r}
#| label: fig-negbin
#| fig-cap: "Negative binomial distribution to describe overdispersed count data"
negbin_data |>
pivot_longer(1:2, names_to = "P",
values_to = "value") |>
ggplot(aes(value)) +
geom_histogram(fill = "#339966",
color = "white", bins = 15) +
facet_wrap( ~ P) +
theme_r4pde()
```
### Gamma distribution
In plant disease epidemiology and other fields of study, we may often encounter continuous variables - these are variables that can take on any value within a given range, including both whole numbers and fractions. An example of a continuous variable in this context is lesion size, which can theoretically be any non-negative value.
Often, researchers use the normal (Gaussian) distribution to model such continuous variables. The normal distribution is symmetric, bell-shaped, and is fully described by its mean and standard deviation. However, a fundamental characteristic of the normal distribution is that it extends from negative infinity to positive infinity. While this is not a problem for many applications, it becomes an issue when the variable being modeled cannot take on negative values - like the size of a lesion.
This is where the Gamma distribution can be a good alternative. The Gamma distribution is a two-parameter family of continuous probability distributions, which does not include negative values, making it an appropriate choice for modeling variables like lesion sizes. While it might seem a bit more complicated due to its two parameters, this also allows it a greater flexibility in terms of the variety of shapes and behaviors it can describe. The Gamma distribution is often used in various scientific disciplines, including queuing models, climatology, financial services, and of course, epidemiology. Its main parameters are the shape and scale (or alternatively shape and rate), which control the shape, spread and location of the distribution.
We can use the `rgamma()` function that requires the number of samples (n = 100 in our case) and the `shape`, or the mean value.
```{r}
gamma10 <- rgamma(n = 100, shape = 10, scale = 1)
gamma35 <- rgamma(n = 100, shape = 35, scale = 1)
gamma_data <- data.frame(gamma10, gamma35)
```
```{r}
#| label: fig-gamma
#| fig-cap: "Gamma distribution to describe continuous data"
gamma_data |>
pivot_longer(1:2, names_to = "P",
values_to = "value") |>
ggplot(aes(value)) +
geom_histogram(fill = "#339966",
color = "white",
bins = 15) +
ylim(0, max(gamma_data$gamma35)) +
facet_wrap( ~ P) +
theme_r4pde()
```
### Simulating ordinal data
Ordinal data is a statistical data type consisting of numerical scores that fall into a set of categories which are ordered in a meaningful way. This can include survey responses (e.g., strongly disagree to strongly agree), levels of achievement (e.g., poor, average, good, excellent), or, in the case of plant disease, disease severity scales (e.g., 0 to 5, where 0 represents a healthy plant and 5 represents a plant with severe symptoms).
When working with ordinal data, we often need to make assumptions about the distribution of the data. However, unlike continuous data which might be modeled by a normal or Gamma distribution, or count data which might be modeled by a Poisson distribution, ordinal data is discrete and has a clear order but the distances between the categories are not necessarily equal or known. This makes the modeling process slightly different.
We can use the `sample()` function and define the probability associated with each rank. Let's generate 30 units with a distinct ordinal score. In the first situation, the higher probabilities (0.5) are for scores 4 and 5 and lower (0.1) for scores 0 and 1, and in the second situation is the converse.
```{r}
ordinal1 <- sample(0:5, 30, replace = TRUE, prob = c(0.1, 0.1, 0.2, 0.2, 0.5, 0.5))
ordinal2 <- sample(0:5, 30, replace = TRUE, prob = c(0.5, 0.5, 0.2, 0.2, 0.1, 0.1))
ordinal_data <- data.frame(ordinal1, ordinal2)
```
```{r}
#| label: fig-ordinal
#| fig-cap: "Sampling of ordinal data"
ordinal_data |>
pivot_longer(1:2, names_to = "P",
values_to = "value") |>
ggplot(aes(value)) +
geom_histogram(fill = "#339966",
color = "white",
bins = 6) +
facet_wrap( ~ P) +
theme_r4pde()
```