4. PREPROCESSING

Introduction

Heactomb is designed to perform rigorous quality control prior to assembly and taxonomic assignment. This rigor is justified primarily by the philosophy of Garbage In, Garbage out (GIGO). More specifically, the following issues are dealt with to ensure that non-contaminat biological seqeunce is reserved for downstream analysis.

1. Non-biological sequence removal (primers, adapters)
2. Host sequence removal
3. Removal of redundant sequences (clustering)
	- Creation of sequence count table
	- Calculation of sequence properties (e.g. GC content, tetramer frequencies)
4. Assembly
	- Sample assembly
	- Population assembly
	- Contig abundance esitmation

The preprocessing rule also goes ahead and does assembly as contigs are an important prerequisite for many downstream analysis.

Non-biological sequences

During the library production

Additional Reading:

- [Official Snakemake documentation](https://snakemake.readthedocs.io/en/stable/)
- [bbtools](https://jgi.doe.gov/data-and-tools/bbtools/)
- [seqkit](https://bioinf.shenwei.me/seqkit/)
- [minimap2](https://github.com/lh3/minimap2)
- [mmseqs2 GitHub](https://github.com/soedinglab/MMseqs2)
- [Megahit GitHub](https://github.com/voutcn/megahit)
- [Flye Github](https://github.com/fenderglass/Flye)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4. PREPROCESSING

Introduction

Non-biological sequences

Additional Reading:

Clone this wiki locally