Skip to content
This repository has been archived by the owner on Oct 15, 2020. It is now read-only.

GPSeq in a nutshell

Gabriele Girelli edited this page Apr 24, 2018 · 7 revisions

Data description

Let's focus on a GPSeq experiment E comprising n+1 conditions D:

Edef

where with D_0 we indicate a negative condition with no restricted cutsites and, thus, no reads. Each condition D_i is characterized by different number of cells N_C(D_i) and reads N_R(D_i), with N_C(D_0)>0 and N_R(D_0)=0.

It is important to notice how the maximum resolution of a GPSeq experiment is at single-cutsite, as the cutsite is where de-duplicated reads are located. In other words, the cutsite is GPSeq's unit of measure.

Depending on the restriction enzyme, e.g., 4 bp or 6 bp cutter, cutsites are more or less sparse, allowing to achieve a higher or lower theoretical maximum resolution (TMR). The real maximum resolution (RMR) is always lower than the theoretical maximum resolution (RMR<TMR), as some reads are lost during sequencing and some cutsites are never digested.

Also, notice that each cutsite can have up to N_C(D_i) reads as, after de-duplication, a read represents a digestion event occurring in one cell.

Probability of restriction

Let's define a genomic region wdef located on chromosome c between the genomic coordinates f (first) and l (last). Taking into account that GPSeq captures restriction events, and that we focus only on restricted cutsite, we can define the probability of digesting w as:

Pdef

where N_R(w,D_i) is the number of de-duplicated reads mapping to w in condition D_i, and N_s(w,D_i) is the number of cutsites in w in condition D_i considered in the analysis.

In other words, we normalize the number of restriction events in a genomic region by the number of restriction events in the condition and the number of considered cutsites in the region itself; this makes P(w,D_i) comparable across different regions and different conditions.

The number of reads in the region w in condition D_i is:

NRdef

where s_i is the ![i]-th cutsite in w. Remember that a cutsite sdef is, essentially, a small region; then:

sderiv

In other words, we consider a cutsite ![s] as belonging to region w when region and site are on the same chromosome and the site start position is included in the region.

Additionally, we degine a non-empty cutsite as a cutsite with at least a mapped de-duplicated read, corresponding to a cutsite being restricted in a cell and being sampled during the sequencing run.

NRcond

Restriction event count: average and variance

Let's consider a genomic region w comprising k units (single/grouped-cutsites, see 2.d), we can define a mean (E) and variance (V) of the restriction events count in the units of the window.

Edef

Vdef

Please, not that V is a sample variance (normalized over k-1).