Skip to content

Commit

Permalink
Merge pull request #33 from FredHutch/filter_qc_ki
Browse files Browse the repository at this point in the history
update QC report with two graphs
  • Loading branch information
cansavvy authored Jun 25, 2024
2 parents 15affcd + 97241a7 commit fa3cf08
Show file tree
Hide file tree
Showing 11 changed files with 222 additions and 12 deletions.
11 changes: 9 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,16 @@ Version: 0.1.0
Authors@R: c(
person("Candace", "Savonen", ,
c("cansav09@gmail.com","csavonen@fredhutch.org"),
role = c("aut", "cre")),
role = c("aut", "cre")
),
person("Howard", "Baek", ,
c("howardbaek@fredhutch.org","howardbaek.fh@gmail.com"),
role = c("aut"))
role = c("aut")
),
person("Kate", "Isaac", ,
c("kisaac@fredhutch.org"),
role = c("aut")
)
)
Description: This package allows a pipeline to be built for analyzing genetic interactions of paired guide CRISPR screens. It is based off of original research from the Alice Berger lab at Fred Hutchinson Cancer Center
License: GPL-3
Expand All @@ -25,6 +31,7 @@ Imports:
magrittr,
kableExtra,
pheatmap,
purrr,
Suggests:
testthat (>= 3.0.0),
roxygen2,
Expand Down
4 changes: 4 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,12 @@ export(run_qc)
export(setup_data)
import(ggplot2)
import(kableExtra)
importFrom(dplyr,mutate)
importFrom(ggplot2,ggplot)
importFrom(ggplot2,labs)
importFrom(magrittr,"%>%")
importFrom(pheatmap,pheatmap)
importFrom(purrr,map)
importFrom(purrr,reduce)
importFrom(tidyr,pivot_longer)
importFrom(tidyr,unite)
29 changes: 28 additions & 1 deletion R/02-filter.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
#' @description This is a function here's where we describe what it does
#' @param .data Data can be piped in with %>% or |> from function to function. But the data must still be a gimap_dataset
#' @param gimap_dataset A special dataset structure that is setup using the `setup_data()` function.
#' @param filter_type Can be one of the following: `zero_count_only`, `low_plasmid_cpm_only` or `rep_variation`, `zero_in_last_time_point` or a vector that includes multiple of these filters.
#' @param filter_type Can be one of the following: `zero_count_only`, `low_plasmid_cpm_only` or `rep_variation`, `zero_in_last_time_point` or a vector that includes multiple of these filters.
#' You should decide on the appropriate filter based on the results of your QC report.
#' @returns a filtered version of the gimap_dataset returned in the $filtered_data section
#' @export
Expand All @@ -20,6 +20,8 @@
#' gimap_dataset$filtered_data
#'
#' }
#'

gimap_filter <- function(.data = NULL,
gimap_dataset,
filter_type = "both") {
Expand All @@ -32,3 +34,28 @@ gimap_filter <- function(.data = NULL,

return(gimap_dataset)
}

# Possible filters

#' Create a filter for pgRNAs which have a raw count of 0 for any sample/time point
#' @description This function flags and reports which and how many pgRNAs have a raw count of 0 for any sample/time point
#' @param gimap_dataset The special gimap_dataset from the `setup_data` function which contains the transformed data
#' @importFrom magrittr %>%
#' @importFrom dplyr mutate
#' @importFrom purrr reduce map
#' @return a named list with the filter `filter` specifying which pgRNA have a count zero for at least one sample/time point and a report df `reportdf` for the number and percent of pgRNA which have a count zero for at least one sample/time point
#' @examples \dontrun{
#'
#' }
#'

qc_filter_zerocounts <- function(gimap_dataset){

counts_filter <- data.frame(gimap_dataset$raw_counts) %>% map(~.x %in% c(0)) %>% reduce(`|`)

zerocount_df <- data.frame("RawCount0" = c(FALSE, TRUE), n = c(sum(!counts_filter), sum(counts_filter))) %>%
mutate(percent = round(((n/sum(n))*100),2))

return(list(filter = counts_filter, reportdf = zerocount_df))

}
81 changes: 77 additions & 4 deletions R/plots-qc.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
#' @param gimap_dataset The special gimap_dataset from the `setup_data` function which contains the transformed data
#' @param qc_obj The object that has the qc stuff stored
#' @param wide_ar aspect ratio, default is 0.75
#' @importFrom tidyr pivot_longer
#' @importFrom tidyr pivot_longer unite
#' @importFrom ggplot2 ggplot labs
#' @return counts_cdf a ggplot
#' @examples \dontrun{
Expand Down Expand Up @@ -64,6 +64,80 @@ qc_sample_hist <- function(gimap_dataset, wide_ar = 0.75) {
return(sample_cpm_histogram)
}

#' Create a histogram for the variance within replicates for each pgRNA
#' @description This function uses pivot_longer to rearrange the data for plotting, finds the variance for each pgRNA construct (using row number as a proxy) and then plots a histogram of these variances
#' @param gimap_dataset The special gimap_dataset from the `setup_data` function which contains the transformed data
#' @param wide_ar aspect ratio, default is 0.75
#' @importFrom tidyr pivot_longer
#' @importFrom magrittr %>%
#' @import ggplot2
#' @return a ggplot histogram
#' @examples \dontrun{
#' gimap_dataset <- get_example_data("gimap")
#' qc_variance_hist(gimap_dataset)
#' }
#'

qc_variance_hist <- function(gimap_dataset, wide_ar = 0.75){

return(
gimap_dataset$transformed_data$log2_cpm[,3:5] %>%
as.data.frame() %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(-row) %>%
group_by(row) %>%
summarize(var = var(value)) %>%
ggplot(aes(x=var)) +
geom_histogram(binwidth = 0.1) +
theme(panel.background = element_blank(),
panel.grid = element_blank(),
aspect.ratio = wide_ar) +
xlab("variance") +
ylab("pgRNA construct count")
)

}

#' Create a bar graph that shows the number of replicates with a zero count for pgRNA constructs flagged by the zero count filter
#' @description A short description...
#' @param gimap_dataset The special gimap_dataset from the `setup_data` function which contains the transformed data
#' @param wide_ar aspect ratio, default is 0.75
#' @importFrom tidyr pivot_longer
#' @importFrom magrittr %>%
#' @import ggplot2
#' @return a ggplot barplot
#' @examples \dontrun{
#' gimap_dataset <- get_example_data("gimap")
#' qc_constructs_countzero_bar(gimap_dataset)
#' }
#'

qc_constructs_countzero_bar <- function(gimap_dataset, wide_ar = 0.75){

qc_filter_output <- qc_filter_zerocounts(gimap_dataset)

return(
example_counts[qc_filter_output$filter, c(3:5)] %>%
as.data.frame() %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(tidyr::unite(gimap_dataset$metadata$sample_metadata[c(3:5), c("day", "rep")], "colName")$colName,
values_to = "counts") %>%
group_by(row) %>%
summarize(numzero = sum(counts == 0),
max_diff = max(counts) - min(counts),
sec_diff = min(counts[counts > 0]) - min(counts)
) %>%
group_by(numzero) %>%
summarize(count = n()) %>%
ggplot(aes(x=numzero, y = count)) +
geom_bar(stat = "identity") +
theme_classic() +
ylab("Number of pgRNAs") +
xlab("Number of replicates with a zero") +
geom_text(aes(label = count, group = numzero), vjust = -0.5, size=2)
)
}

#' Create a correlation heatmap for the pgRNA CPMs
#' @description This function uses the `cor` function to find correlations between the sample CPM's and then plots a heatmap of these
#' @param gimap_dataset The special gimap_dataset from the `setup_data` function which contains the transformed data
Expand All @@ -75,7 +149,7 @@ qc_sample_hist <- function(gimap_dataset, wide_ar = 0.75) {
#'
#' }
#'
qc_cor_heatmap <- function(gimap_dataset, ...) {
qc_cor_heatmap <- function(gimap_dataset) {
cpm_cor <- gimap_dataset$transformed_data$cpm %>%
cor() %>%
round(2) %>%
Expand All @@ -87,8 +161,7 @@ qc_cor_heatmap <- function(gimap_dataset, ...) {
cellwidth = 20,
cellheight = 20,
treeheight_row = 20,
treeheight_col = 20,
...
treeheight_col = 20
)

return(sample_cor_heatmap)
Expand Down
15 changes: 15 additions & 0 deletions inst/rmd/gimapQCTemplate.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,21 @@ qc_sample_hist(gimap_dataset)
qc_cor_heatmap(gimap_dataset)
```

## Variance within replicates

```{r}
qc_variance_hist(gimap_dataset)
```

# Applying potential filters

## Replicates with a pgRNA count of 0

```{r}
qc_constructs_countzero_bar(gimap_dataset)
```

# Session Info

```{r}
sessionInfo()
Expand Down
3 changes: 2 additions & 1 deletion man/gimap_filter.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 26 additions & 0 deletions man/qc_constructs_countzero_bar.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/qc_cor_heatmap.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 23 additions & 0 deletions man/qc_filter_zerocounts.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

26 changes: 26 additions & 0 deletions man/qc_variance_hist.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 11 additions & 3 deletions vignettes/getting_started.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -133,12 +133,20 @@ str(gimap_dataset)

The first step is running some quality checks on our data. The `run_qc()` function will create a report we can look at to assess this.

The report includes several visualizations:
The report includes several visualizations of raw/unfiltered data:

- distribution of `count_norm` values for each sample. The goal of determining this distribution is to identify pgRNA counts that are low at the start of the screen - before any type of phenotypic or growth selection is occurring, either in the T0 or plasmid sample. These low abundance pgRNAs should be removed from the analysis.
- distribution of normalized counts for each sample. The goal of determining this distribution is to identify pgRNA counts that are low at the start of the screen - before any type of phenotypic or growth selection is occurring, either in the T0 or plasmid sample. These low abundance pgRNAs should be removed from the analysis. [the goal section here doesn't fit my expectations/understanding of this plot.]
- histogram of `log2cpm` values for each individual sample: this helps users identify samples that do not have a normal distribution of reads and inform the upcoming filtering steps.
- sample correlation heatmap: generates a heatmap using cpm values for each replicate/sample. This heatmap gives an overview on how similar samples are. Replicates should correlate well, and cluster together, while each timepoint sample should be different from the T0. This analysis will also allow users to identify potential sample swaps, if correlation scores between replicates is poor.
- log2cpm of plasmid counts. Any reads falling outside of the 1.5*IQR threshold will be filtered [TODO: Candace/Kate/Howard or is this being used as an argument in the code?]
- A histogram that shows the variance within replicates for each pgRNA construct. For each pgRNA construct, the variance among the 3 replicates is found and a distribution is constructed by looking at these variances together.

This report also includes several visualizations after filters are applied:

There is a filter that flags pgRNA constructs where any of the time points have a count of zero.
- We include a bar plot that shows the number of pgRNA constructs which have counts of zero in either 0, 1, 2, or 3 replicates.

There is a filter that flags pgRNA constructs that have low log2 CPM counts for the day 0 or plasmid time point.
- We include a histogram of the log2cpm of plasmid counts. Any reads falling outside of the 1.5*IQR threshold will be filtered [TODO: Candace/Kate/Howard or is this being used as an argument in the code?; note this graph isn't currently in the report]

```{r}
run_qc(gimap_dataset,
Expand Down

0 comments on commit fa3cf08

Please sign in to comment.