Fishersexactest.qmd

---
title: "Fisher's Exact Test"
subtitle: "Explaining Fisher's Exact Test, how to run it in R and it's interpretation."
title-block-banner: true
author:
  - name: Tendai Gwanzura
  - name: Ana Bravo
format:
  html:
    code-fold: true
    html-math-method: katex
toc: true
editor: visual
theme: zephyr
bibliography: export-data.bib
---

## Introduction

-   Fisher's exact test is an independent test used to determine if there is a relationship between categorical (non-parametric) variables with a small sample size.

-   Used to assess whether proportions of one variable are different among values of another table.

-   Uses (hypergeometric) marginal distribution to derive exact p-values which are not approximated which are somewhat conservative.

-   The rules of Chi distribution do not apply when the frequency count is \<5 for more than 20% of the cells in a contingency table (Bower 2003).

-   Data is easily manipulated by using a contingency table.


### Assumptions

1.  Assumes that the individual observations are independent.
2.  Assumes that the row and column totals are fixed or conditioned.
3.  The variables are categorical and randomly sampled.
4.  Observations are count data.

### Hypotheses

The hypotheses of Fisher's exact test are similar to Chi-square test:

Null hypothesis:$(H_0)$ There is no relationship between the categorical variables, the variables are independent.

Alternative hypothesis: $(H_1)$ There is a relationship between the categorical variables, the variables are dependent.

## Fisher's Exact Test Equation

Fisher's exact test for a one-tailed p-value is calculated using the following formula: 

$$   p = {(a+b)!(c+d)!(a+c)!(b+d)! \over a! b! c! d! n!} $$
 - n = population size/ total frequency
 - a + b = "successes" values in the contingency table
 - a + c = sample size / draws from the population
 - a = sample successes
 
### Formula description

this test is usually used as a one-tailed test but it can also be used as a two tailed test as well, $a$,$b$,$c$, and $d$ are the individual frequencies on the 2x2 contingency table and $n$ is our total frequency. This particular test is used to obtain the probability of the combination of frequencies that we can actually obtain. 
 

### What is a contingency table?

This is a table that shows the distribution of a variable in the rows and columns. Sometimes referred to as a 2x2 table. They are useful in summarizing categorical variables. The table() function is used to create a contingency table in R. When the variables of interest are summarized in a contingency table it is easier to run the Fisher's Exact test.

#### Example: Creating a contingency table

Lets say we have information on the gender of participants in a clinical trial and the type of drug administered to them we can create the following contingency table for further analysis.

```{r, contingency-table}

# Example R code to create a contingency table

# Creating a data frame
 df = data.frame (
   "Drug" = c("Drug A", "Drug B", "Drug A"),
   "Gender" = c("Male", "Male", "Female")
 )
 
# Creating contingency table using table()
 ctable = table(df)
 print(ctable)

```

### Performing Fisher's Exact Test in R

We will need to install the ggstatplot package to visualize the statistical results.

```{r, install }
#| echo: true

#install.packages("ggstatplot") 
#install.packages("summarytools")
#install.packages("gmodels")
#install.packages("gt")
#install.packages("tidyverse")

```

### Data Source: GMP2017

For this example we will be using the Greater Manchester Police's UK stop and search data from 2017(December) sourced from the Sage Research Methods Dataset Part 2. This data has information on stop and search events, gender and ethnicity. For this example we would like to access whether there is a significant relationship between gender and stop and search events?

```{r, load-data}
#| echo: true
GMP17 <- read.csv("dataset-gmss-2017-subset1.csv")
```

### Load in libraries

```{r, load-libraries}
#| echo: true
#| message: false


library(gmodels)
library(ggstatsplot)
library(gt)
library(gtsummary)
library(katex)
library(tidyverse)
```

### Descriptive summary

```{r, data-summary}
#| code-fold: false

head(GMP17)
str(GMP17)
# determining the number of rows
NROW(GMP17)

```

### Assessing frequencies to answer research question

For this analysis we will use the Gender variable and the ObjectSearch variable

```{r}
#| echo: true

# Dropping the Ethnicity variable to remain with variables of interest for for the 2x2 table

newGMP17 <-GMP17[ -c(2) ]
 
head(newGMP17)

  
```

The data contains missing values categorized as -9 that we need to drop and we need to rename our variables based on the data dictionary provided <https://methods.sagepub.com/dataset/fishers-exact-gmss-2017-r>.

```{r}
# Exclude rows that have missing data in both variables
newGMP17_nom <- subset(newGMP17, Gender > 0)
newGMP17_nom2 <- subset(newGMP17_nom, ObjectSearch  > 0)
summary(newGMP17_nom2)
nrow(newGMP17_nom2)

```

```{r}
# Renaming the Gender variable based on data dictionary
newGMP17_nom2$Gender <- 
  recode_factor(
    newGMP17_nom2$Gender,
            "1" = "Male",
            "2" = "Female"
)

# Renaming the Gender variable based on data dictionary
newGMP17_nom2$ObjectSearch <- 
  recode_factor(
    newGMP17_nom2$ObjectSearch,
            "1" = "Controlled_Drugs",
            "2" = "Harmful_Objects"
)

```

```{r}

# Creating the contingency table for subset data
cGMP17 = table(newGMP17_nom2)
print(cGMP17)
```

### Visualizing data using mosaic plot

-   we can use the mosaic plot to represent the data.

```{r}
#| echo: true
mosaicplot(cGMP17,
           main ='Mosaic Plot',
           color = TRUE)

```

### Running the Fisher's exact test using fisher.test()

**What if we just run a Chi-square test?**

Using our GMP17 dataset we can try to run a Chi-square test instead of the Fisher's Exact test and see what happens.

The R output gives us a warning that the Chi Square is not appropriate hence we should use another test in this case the Fisher's Exact Test.

```{r}
#| echo: true
chisq.test(cGMP17)$expected
```

### Running the test

```{r}
# running the fisher's exact test

test <- fisher.test(cGMP17)
test


```

Using the gt summary to view results.

```{r}


newGMP17_nom2 |> 
  tbl_summary(by = Gender) |> 
   add_p() |> 
  add_overall()


```

### Interpretation of results

The most important test statistic is the p - value therefore we can retrieve the specific result using the following code;

```{r}
#| echo: true

test$p.value 

```

Odds ratio = 6.33, 95% CI = 0.85-73.59\], we reject the null hypothesis (*p* \< 0.05) and conclude that there is a strong association between the two categorical independent variables (gender and object search events)

Therefore the odds ratio indicates that the odds of getting stopped and searched by gender is 6.33 times as likely for males compared to females. In other words, males are more likely of getting stopped and searched than females.

### Visualizing statistical results with plots using ggstatsplot

-   we download the ggsattsplot package to visualize the results in a plot.

```{r}
#| echo: true

# Fisher's exact test 

test <- fisher.test(cGMP17)

# combine plot and statistical test with ggbarstats

ggbarstats(
 newGMP17_nom2, Gender, ObjectSearch,
 results.subtitle = FALSE,
 subtitle = paste0(
 "Fisher's exact test", ", p-value = ",
 ifelse(test$p.value < 0.001, "< 0.001", round(test$p.value, 3))
  )
 )

```

From the plot, it is clear that the proportion of males among object search events is higher compared to females, suggesting that there is a relationship between the two variables.

This is confirmed thanks to the p-value displayed in the subtitle of the plot. As previously, we reject the null hypothesis and we conclude that the variables gender and stop and search events are not dependent (p-value = 0.038).

## What if we have more than two levels?

Using the drug example used previously lets say we have 3 drugs 'Drug A, Drug B or Drug C' and we want to see if there is any relationship with gender 'Male/Female'.

```{r}
#| echo: true

# Creating a data frame
 df = data.frame (
   "Drug" = c("Drug A", "Drug B", "Drug A", "Drug C", "Drug C"),
   "Gender" = c("Male", "Male", "Female", "Female", "Female")
 )
 
# Creating contingency table using table()
 ctable = table(df)
 print(ctable)

```

```{r}

# Running the Fisher's Exact test for the 3x2 table
fisher.test(ctable)
```

The p-value is non-significant \[*p* = 0.6\], we fail to reject the null hypothesis (*p* \< 0.05) and conclude that there is no association between the drug treatments and gender. If the results had been significant we would have gone ahead and conducted a pair wise comparison. 

## References

1.  Bower, Keith M. 2003. "When to Use Fisher's Exact Test." In *American Society for Quality, Six Sigma Forum Magazine*, 2:35--37. 4.
2.  McCrum-Gardner, Evie. 2008. "Which Is the Correct Statistical Test to Use?" *British Journal of Oral and Maxillofacial Surgery* 46 (1): 38--41.
3.  Wong KC. [Chi squared test versus Fisher's exact test](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5426219/). Hong Kong Med J. 2011 Oct;17(5):427
4.  Patil, I. (2021). Visualizations with statistical details: The 'ggstatsplot' approach. Journal of Open Source Software, 6(61), 3167, doi:10.21105/joss.03167
5. Zach Bobbit. (2021). [Fisher's Exact Test: Definition, Formula, and Example](https://www.statology.org/fishers-exact-test/)
6. Bobbitt, Z. (2020). "Fisher's Exact Test: Definition, Formula, and Example." [statology.org](https://www.statology.org/fishers-exact-test/)