Skip to content

4. PREPROCESSING

shandley edited this page Mar 16, 2021 · 3 revisions

Introduction

Heactomb is designed to perform rigorous quality control prior to assembly and taxonomic assignment. This rigor is justified primarily by the philosophy of Garbage In, Garbage out (GIGO). More specifically, the following issues are dealt with to ensure that non-contaminat biological seqeunce is reserved for downstream analysis.

1. Non-biological sequence removal (primers, adapters)
2. Host sequence removal
3. Removal of redundant sequences (clustering)
	- Creation of sequence count table
	- Calculation of sequence properties (e.g. GC content, tetramer frequencies)
4. Assembly
	- Sample assembly
	- Population assembly
	- Contig abundance esitmation

The preprocessing rule also goes ahead and does assembly as contigs are an important prerequisite for many downstream analysis.

Non-biological sequences

During the library production

Additional Reading:

- [Official Snakemake documentation](https://snakemake.readthedocs.io/en/stable/)
- [bbtools](https://jgi.doe.gov/data-and-tools/bbtools/)
- [seqkit](https://bioinf.shenwei.me/seqkit/)
- [minimap2](https://github.com/lh3/minimap2)
- [mmseqs2 GitHub](https://github.com/soedinglab/MMseqs2)
- [Megahit GitHub](https://github.com/voutcn/megahit)
- [Flye Github](https://github.com/fenderglass/Flye)
Clone this wiki locally