-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFishersexactest.qmd
319 lines (207 loc) · 9.45 KB
/
Fishersexactest.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
---
title: "Fisher's Exact Test"
subtitle: "Explaining Fisher's Exact Test, how to run it in R and it's interpretation."
title-block-banner: true
author:
- name: Tendai Gwanzura
- name: Ana Bravo
format:
html:
code-fold: true
html-math-method: katex
toc: true
editor: visual
theme: zephyr
bibliography: export-data.bib
---
## Introduction
- Fisher's exact test is an independent test used to determine if there is a relationship between categorical (non-parametric) variables with a small sample size.
- Used to assess whether proportions of one variable are different among values of another table.
- Uses (hypergeometric) marginal distribution to derive exact p-values which are not approximated which are somewhat conservative.
- The rules of Chi distribution do not apply when the frequency count is \<5 for more than 20% of the cells in a contingency table (Bower 2003).
- Data is easily manipulated by using a contingency table.
### Assumptions
1. Assumes that the individual observations are independent.
2. Assumes that the row and column totals are fixed or conditioned.
3. The variables are categorical and randomly sampled.
4. Observations are count data.
### Hypotheses
The hypotheses of Fisher's exact test are similar to Chi-square test:
Null hypothesis:$(H_0)$ There is no relationship between the categorical variables, the variables are independent.
Alternative hypothesis: $(H_1)$ There is a relationship between the categorical variables, the variables are dependent.
## Fisher's Exact Test Equation
Fisher's exact test for a one-tailed p-value is calculated using the following formula:
$$ p = {(a+b)!(c+d)!(a+c)!(b+d)! \over a! b! c! d! n!} $$
- n = population size/ total frequency
- a + b = "successes" values in the contingency table
- a + c = sample size / draws from the population
- a = sample successes
### Formula description
this test is usually used as a one-tailed test but it can also be used as a two tailed test as well, $a$,$b$,$c$, and $d$ are the individual frequencies on the 2x2 contingency table and $n$ is our total frequency. This particular test is used to obtain the probability of the combination of frequencies that we can actually obtain.
### What is a contingency table?
This is a table that shows the distribution of a variable in the rows and columns. Sometimes referred to as a 2x2 table. They are useful in summarizing categorical variables. The table() function is used to create a contingency table in R. When the variables of interest are summarized in a contingency table it is easier to run the Fisher's Exact test.
#### Example: Creating a contingency table
Lets say we have information on the gender of participants in a clinical trial and the type of drug administered to them we can create the following contingency table for further analysis.
```{r, contingency-table}
# Example R code to create a contingency table
# Creating a data frame
df = data.frame (
"Drug" = c("Drug A", "Drug B", "Drug A"),
"Gender" = c("Male", "Male", "Female")
)
# Creating contingency table using table()
ctable = table(df)
print(ctable)
```
### Performing Fisher's Exact Test in R
We will need to install the ggstatplot package to visualize the statistical results.
```{r, install }
#| echo: true
#install.packages("ggstatplot")
#install.packages("summarytools")
#install.packages("gmodels")
#install.packages("gt")
#install.packages("tidyverse")
```
### Data Source: GMP2017
For this example we will be using the Greater Manchester Police's UK stop and search data from 2017(December) sourced from the Sage Research Methods Dataset Part 2. This data has information on stop and search events, gender and ethnicity. For this example we would like to access whether there is a significant relationship between gender and stop and search events?
```{r, load-data}
#| echo: true
GMP17 <- read.csv("dataset-gmss-2017-subset1.csv")
```
### Load in libraries
```{r, load-libraries}
#| echo: true
#| message: false
library(gmodels)
library(ggstatsplot)
library(gt)
library(gtsummary)
library(katex)
library(tidyverse)
```
### Descriptive summary
```{r, data-summary}
#| code-fold: false
head(GMP17)
str(GMP17)
# determining the number of rows
NROW(GMP17)
```
### Assessing frequencies to answer research question
For this analysis we will use the Gender variable and the ObjectSearch variable
```{r}
#| echo: true
# Dropping the Ethnicity variable to remain with variables of interest for for the 2x2 table
newGMP17 <-GMP17[ -c(2) ]
head(newGMP17)
```
The data contains missing values categorized as -9 that we need to drop and we need to rename our variables based on the data dictionary provided <https://methods.sagepub.com/dataset/fishers-exact-gmss-2017-r>.
```{r}
# Exclude rows that have missing data in both variables
newGMP17_nom <- subset(newGMP17, Gender > 0)
newGMP17_nom2 <- subset(newGMP17_nom, ObjectSearch > 0)
summary(newGMP17_nom2)
nrow(newGMP17_nom2)
```
```{r}
# Renaming the Gender variable based on data dictionary
newGMP17_nom2$Gender <-
recode_factor(
newGMP17_nom2$Gender,
"1" = "Male",
"2" = "Female"
)
# Renaming the Gender variable based on data dictionary
newGMP17_nom2$ObjectSearch <-
recode_factor(
newGMP17_nom2$ObjectSearch,
"1" = "Controlled_Drugs",
"2" = "Harmful_Objects"
)
```
```{r}
# Creating the contingency table for subset data
cGMP17 = table(newGMP17_nom2)
print(cGMP17)
```
### Visualizing data using mosaic plot
- we can use the mosaic plot to represent the data.
```{r}
#| echo: true
mosaicplot(cGMP17,
main ='Mosaic Plot',
color = TRUE)
```
### Running the Fisher's exact test using fisher.test()
**What if we just run a Chi-square test?**
Using our GMP17 dataset we can try to run a Chi-square test instead of the Fisher's Exact test and see what happens.
The R output gives us a warning that the Chi Square is not appropriate hence we should use another test in this case the Fisher's Exact Test.
```{r}
#| echo: true
chisq.test(cGMP17)$expected
```
### Running the test
```{r}
# running the fisher's exact test
test <- fisher.test(cGMP17)
test
```
Using the gt summary to view results.
```{r}
newGMP17_nom2 |>
tbl_summary(by = Gender) |>
add_p() |>
add_overall()
```
### Interpretation of results
The most important test statistic is the p - value therefore we can retrieve the specific result using the following code;
```{r}
#| echo: true
test$p.value
```
Odds ratio = 6.33, 95% CI = 0.85-73.59\], we reject the null hypothesis (*p* \< 0.05) and conclude that there is a strong association between the two categorical independent variables (gender and object search events)
Therefore the odds ratio indicates that the odds of getting stopped and searched by gender is 6.33 times as likely for males compared to females. In other words, males are more likely of getting stopped and searched than females.
### Visualizing statistical results with plots using ggstatsplot
- we download the ggsattsplot package to visualize the results in a plot.
```{r}
#| echo: true
# Fisher's exact test
test <- fisher.test(cGMP17)
# combine plot and statistical test with ggbarstats
ggbarstats(
newGMP17_nom2, Gender, ObjectSearch,
results.subtitle = FALSE,
subtitle = paste0(
"Fisher's exact test", ", p-value = ",
ifelse(test$p.value < 0.001, "< 0.001", round(test$p.value, 3))
)
)
```
From the plot, it is clear that the proportion of males among object search events is higher compared to females, suggesting that there is a relationship between the two variables.
This is confirmed thanks to the p-value displayed in the subtitle of the plot. As previously, we reject the null hypothesis and we conclude that the variables gender and stop and search events are not dependent (p-value = 0.038).
## What if we have more than two levels?
Using the drug example used previously lets say we have 3 drugs 'Drug A, Drug B or Drug C' and we want to see if there is any relationship with gender 'Male/Female'.
```{r}
#| echo: true
# Creating a data frame
df = data.frame (
"Drug" = c("Drug A", "Drug B", "Drug A", "Drug C", "Drug C"),
"Gender" = c("Male", "Male", "Female", "Female", "Female")
)
# Creating contingency table using table()
ctable = table(df)
print(ctable)
```
```{r}
# Running the Fisher's Exact test for the 3x2 table
fisher.test(ctable)
```
The p-value is non-significant \[*p* = 0.6\], we fail to reject the null hypothesis (*p* \< 0.05) and conclude that there is no association between the drug treatments and gender. If the results had been significant we would have gone ahead and conducted a pair wise comparison.
## References
1. Bower, Keith M. 2003. "When to Use Fisher's Exact Test." In *American Society for Quality, Six Sigma Forum Magazine*, 2:35--37. 4.
2. McCrum-Gardner, Evie. 2008. "Which Is the Correct Statistical Test to Use?" *British Journal of Oral and Maxillofacial Surgery* 46 (1): 38--41.
3. Wong KC. [Chi squared test versus Fisher's exact test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5426219/). Hong Kong Med J. 2011 Oct;17(5):427
4. Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach. Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
5. Zach Bobbit. (2021). [Fisher's Exact Test: Definition, Formula, and Example](https://www.statology.org/fishers-exact-test/)
6. Bobbitt, Z. (2020). "Fisher's Exact Test: Definition, Formula, and Example." [statology.org](https://www.statology.org/fishers-exact-test/)