What follows is a technical documentation of each step in the pipeline.
The steps get_cellranger
,
get_gtf
,
get_fa
,
get_9_to_10
,
get_10_to_39
,
unpack_bcl2fastq
,
and clean_names
are not here covered,
as the simply move or rename files.
Runs FastQC on all input FastQ files.
5 threads are used to increase the memory available per sample.
Aggregates various QC metrics using MultiQC.
For this pipeline,
metrics are collected from Rule: fastqc,
Rule: unpack_bcl2fastq
,
and Rule: star.
Converts the given PolyA_DB to a BED file.
Though UCSC LiftOver tool can take several types of inputs, it proves easiest to construct and use a BED file, as it is their default. BED files are 0-indexed, not 1-indexed like PolyA_DB and GTF, so we must subtract 1 from the passed values.
Converts the given PolyA_DB coordinates to mm39
.
PolyA_DB provides coordinates from the mm9
reference genome,
which was assembled in 2010.
To use this information on modern builds,
we must first convert to mm10
-
as LiftOver cannot convert directly from mm9
to modern builds -
before converting to the most recent mm39
.
To do this,
liftover is simply run twice:
- Converts from
mm9
tomm10
- Converts from
mm10
tomm39
, using the above output.
Runs cellranger mkgtf
to filter the GTF.
This filters the provided GTF file to a reduced list of relevant gene and transcript biotypes. The list used is described in CellRanger's instructions for making a custom reference. This is a potential point of optimisation. Currently, all features are hardcoded. One possibility would be to pass them as a list from the config file, allowing the user to specify their own features, should they desire.
Alter the downloaded GTF reference file.
Use the venerable awk to modify the existing GTF. This is not done inplace, so the original remains for reference.
The necessary awk
command can actually be passed as a one-liner,
making Snakemake's shell directive theoretically possible.
However,
the command is pretty gnarly,
and nearly impossible to read,
in such a format.
Instead a shell script is used,
with Snakemake variables passed and parsed using getopts.
Iteration occurs line-by-line over the input BED file,
splitting the fields into variables.
The current feature endpoint is extracted from field 5 of the GTF
in the line that contains the gene name and the feature type gene
.
The new endpoint is incremented by 1 to account for the BED -> GTF conversion.
The old endpoint is replaced with the new enpoint in any line where:
- The old endpoint is present and...
- The gene name is present
This has the effect of altering the:
- Gene
- Transcript
- Final Exon
- 3' UTR
for the given gene.
Uses cellranger mkref
to create the final reference.
Relatively straightforward: uses the filtered, updated GTF, and the provided FA to produce the necessary reference genome.
Uses cellranter count
to create the count matrix.
Interestingly, CellRanger does not seem to let you create the results in a sub-directory, but Snakemake seems to create a non-empty directory at the specified results paths. So to get the cellranger results in the correct location, after the run, we delete the Snakemae created directory and move the CellRanger created directory to its location.
Other notes on the passed parameters:
- Secondary analysis is disabled.
- Whether or not introns are included is set by the config file.
- This iterates over samples by lane. This means that the lane can be tracked across sample, even when pooling.