-
Notifications
You must be signed in to change notification settings - Fork 2
GPSeq in a nutshell
Let's focus on a GPSeq experiment comprising conditions :
where with we indicate a negative condition with no restricted cutsites and, thus, no reads. Each condition is characterized by different number of cells and reads , with and .
It is important to notice how the maximum resolution of a GPSeq experiment is at single-cutsite, as the cutsite is where de-duplicated reads are located. In other words, the cutsite is GPSeq's unit of measure.
Depending on the restriction enzyme, e.g., 4 bp or 6 bp cutter, cutsites are more or less sparse, allowing to achieve a higher or lower theoretical maximum resolution (). The real maximum resolution () is always lower than the theoretical maximum resolution (), as some reads are lost during sequencing and some cutsites are never digested.
Also, notice that each cutsite can have up to reads as, after de-duplication, a read represents a digestion event occurring in one cell.
Let's define a genomic region located on chromosome between the genomic coordinates (first) and (last). Taking into account that GPSeq captures restriction events, and that we focus only on restricted cutsite, we can define the probability of digesting as:
where is the number of de-duplicated reads mapping to in condition , and is the number of cutsites in in condition considered in the analysis.
In other words, we normalize the number of restriction events in a genomic region by the number of restriction events in the condition and the number of considered cutsites in the region itself; this makes comparable across different regions and different conditions.
The number of reads in the region in condition is:
where is the ![i]-th cutsite in . Remember that a cutsite is, essentially, a small region; then:
In other words, we consider a cutsite ![s] as belonging to region when region and site are on the same chromosome and the site start position is included in the region.
Additionally, we degine a non-empty cutsite as a cutsite with at least a mapped de-duplicated read, corresponding to a cutsite being restricted in a cell and being sampled during the sequencing run.
Let's consider a genomic region comprising units (single/grouped-cutsites, see 2.d), we can define a mean () and variance () of the restriction events count in the units of the window.
Please, not that is a sample variance (normalized over ).
GPSeqC v2.3.3
is published under the MIT License - Copyright (c) 2017-18 Gabriele Girelli