Normalization and Trimming of amplicon sequencing data
NoTrAmp is a Tool for super fast trimming and read-depth normalization of amplicon reads. It is designed to be used in amplicon-tiling panels (or similar multiplexed amplicon sequencing approaches) to cap coverage of each amplicon and to trim amplicons to their appropriate length removing barcodes, adpaters and primers (if desired) in a single clipping step.
Amplicon-tiling schemes are employed to target and amplify specific sequences and enable coverage of longer regions of DNA with small, contiguous segments using overlapping amplicons. This approach is particularly useful for detection of mutations, characterization of genetic variation and allows generation of high quality assemblies from low input, fragmented DNA. It is frequently utilized for the sequencing of viral genomes and has seen extensive use during the SARS-CoV2 pandemic or during Ebola outbreaks [Citations, links to ARTIC], but is also very useful for exploration of specific genomic loci at high resolution in bacteria or eukaryotes.
Amplicon-tiling protocols include amplification of the target sequences in separate multiplex PCRs build on (typically) two complementary primer pools.
The performance of individual amplicons in these multiplex PCRs can be vastly different, resulting in large variations of read counts for different regions of the target sequence.
The necessity to accumulate enough reads at weak amplicons usually results in amassing orders of magnitude more reads than required at the more efficient amplicons.
This net overproduction increases the data load and can significantly slow down downstream processes. Additionaly, adapters and barcodes that are attached to DNA fragments during sequencing library preparation, as well as the PCR primers, which could otherwise conceal mutations/variations, need to be removed for downstream processing sequencing.
NoTrAmp addresses these issues by limiting the read depth at each amplicon to a set count and performs extremely fast one-step trimming, by removing primers, barcodes and adapters in the same clipping operation.
NoTrAmp is suitable for use with both long (e.g. ONT/PacBio) and short reads (e.g Illumina). However, when using reads that are significantly shorter than amplicon sizes, you should adjust the minimum required alignment length using the --set_min_len argument (see below).
install with pip:
pip install notramp
install with conda:
conda create -n notramp
conda activate notramp
conda install -c simakro notramp
or
conda create -n notramp -c simakro notramp
conda activate notramp
install notramp package and run:
notramp (-a | -c | -t) -p PRIMERS -r READS -g REFERENCE [optional arguments]
or download source from github and run from package dir:
notramp_main.py (-a | -c | -t) -p PRIMERS -r READS -g REFERENCE [optional arguments]
All arguments in detail:
usage:
notramp (-a | -c | -t) -p PRIMERS -r READS -g REFERENCE [optional arguments]
NoTrAmp is a Tool for read-depth normalization and trimming of reads in amplicon-tiling approaches. It trims amplicons to their appropriate length removing barcodes, adpaters and
primers (if desired) in a single clipping step and can be used to cap coverage of all amplicons at a chosen value.
Required arguments:
-p PRIMERS, --primers PRIMERS
Path to primer bed-file (primer-names must adhere to a consistent naming scheme see readme)
-r READS, --reads READS
Path to sequencing reads fasta
-g REFERENCE, --reference REFERENCE
Path to reference (genome) fasta file. Must contain only one target sequence. Multiple target sequences are not currently supported.
-a, --all Perform read depth normalization by coverage-capping/downsampling first, then clip the normalized reads. (mut.excl. with -c, -t)
-c, --cov Perform only read-depth normalization/downsampling. (mut.excl. with -a, -t)
-t, --trim Perform only trimming to amplicon length (excluding primers by default; to include primers set --incl_prim flag). (mut.excl. with -a, -c)
Optional arguments:
-h, --help Print help message and exit
-o OUT_DIR Optionally specify a directory for saving of outfiles. If this argument is not given, out-files will be saved in the directory where the input reads are located.
[default=False]
-m MAX_COV Provide threshold for maximum read-depth per amplicon as integer value. [default=200]
--incl_prim Set this flag if you want to include the primer sequences in the trimmed reads. By default primers are removed together with all overhanging sequences like
barcodes and adapters.
-s SEQ_TEC Specify long-read sequencing technology (ont/pb). [default=ont]
-n NAME_SCHEME Provide path to json-file containing a naming scheme which is consistently used for all primers.[default=artic_nCoV_scheme_v5.3.2]
--set_min_len SET_MIN_LEN
Set a minimum required length for alignments of reads to amplicons. If this is not set the min_len will be 0.8*shortest_amp_len. When using reads that are
shorter than amplicon sizes use this argument to adjust. For long reads this option is usually not required.
--set_max_len SET_MAX_LEN
Set a maximum allowed length for alignments of reads to amplicon. If this is not set the max_len will be 1.2*longest_amp_len. The default setting normally
doesn't need to be changed.
--set_margins MARGINS
Set length of tolerance margins for sorting of mappings to amplicons. [default=5]
--figures [FIGURES] Set to generate figures of input and output read_counts. Available for --all and --cov modes. You can optionally provide a value to draw a red helper line in the
output read plot, showing a threshold, e.g. min. required reads. [default=False; default_threshold=20]
--fastq Set this flag to request output in fastq format. By default output is in fasta format. Has no effect if input file is fasta.
--split Set this flag to request output of capped, untrimmed reads split to amplicon specific files (can be a lot).
--selftest Run a selftest of NoTrAmp using included test-data. Overrides all other arguments and parameters. Useful for checking how NoTrAmp runs in your environment.
-v, --version Print version and exit
NoTrAmp by default generates a separate read file as output for capping and trimming. Capped untrimmed reads are contained in a file ending on ".cap.fasta".
Clipped reads are stored in a file ending on ".clip.fasta". If both capping and trimming were selected, trimmed versions of the capped reads, are written to "YourFileName.cap.clip.fasta". If quality information is required downstream
in your workflow, you can request output in fastq format, by setting the --fastq flag. It is recommended that quality control and
filtering of data is performed before running NoTramp.
Additionaly a log-file ("notramp.log") is generated, that also contains detailed information about processed and selected reads, read coverage/amplicon and trimmed bases.
A visual representation (see below) of input and output reads can also be requested by setting the --figures flag.
NoTramp requires primers in multiplex amplicon tiling panels to follow a consistent scheme. A primer name must consist of a number of fields delineated by a defined separator. Two fields are mandatory: amplicon-number and primer-position(FW/REV). NoTramp offers a high degree of flexibility for naming of primers by allowing the user to specify their own naming scheme. This is provided as json file, which defines the information contained in each field of the primer name. The convention specified in this naming scheme must be followed consistently throughout the entire panel.
Minimal primer naming scheme containing all required key/value pairs (see also "notramp/resources/minimal_scheme.json"):
{
"sep": "_",
"min_len": 2,
"max_len": 2,
"amp_num": 0,
"position": 1,
"fw_indicator": "fw",
"rev_indicator": "rev"
}
"Keyword" | "Description" |
---|---|
"sep" | Separator used to delineate fields in primer name |
"min_len" | minimal number of fields in primer names [must be an int] |
"max_len" | maximum number of fields in primer names (if no alternative primer are in the panel, the same as min_len) [must be an int] |
"amp_num" | 0 based index of the field containing the amplicon-number [must be an int] |
"position" | 0 based index of the field containing information on primer position in the amplicon (e.g. left/right or fw/rev) [must be an int] |
"fw_indicator" | indicator used to identify directionality of the primer; can be anything as long as consistent; typical indicators: "fw", "FW", "left", "start", "+" |
"rev_indicator" | indicator used to identify directionality of the primer; can be anything as long as consistent; typical indicators: "rev", "REV", "right", "end", "-" |
Examples for primers named after a minimal scheme:
1_fw
1_rev
2_fw
2_rev
3_fw
3_rev
...
Use of such naming schemes enforces primers to consistently have names with the same number of fields, which always carry the same kind of information. However, there can be one exception from the same number of fields rule.
Sometimes alternative primer pairs for the "same" amplicon are used to boost underperforming amplicons. These are typically primer sequences that are shifted just a couple of bases to the left or right of the original/primary primer pair, in the hope of increasing the yield for this region. The products generated under involvement of such alternative primers cover the same region of the target sequence, with only slight deviations on the rims/fringes/edges/corners, in the parts overlapping with neighbouring amplicons. Alternative primers are required to carry the same amplicon-number as primary ones, but must have some indicator to be distinguished from those. NoTramp accounts for this irregularity by allowing for alternative primers to be labeled with an "alt" field, which must be supplied as a postfix. Therefore in panels containing alternative primers, max_len must min_len + 1. If your panel includes alternative primers, you have to add the "alt" keyword to your naming scheme, with its corresponding index as value, which must be the last possible field (max_len-1). The alt indicator you use in the name can be anything you want (e.g. "v2", "2", "b", "alt" etc.), but the keyword in the json dict must be "alt".
In addition to the optional "alt" field, any number of custom fields can be added to the naming scheme, if your primer contains additionals field. These could typically be a common name or pool indicator. These fields can be added for completeness, but don't have an effect on the inner workings of NoTramp.
Generic primer scheme (see also "notramp/resources/generic_scheme.json"):
{
"sep": "_",
"min_len": 3,
"max_len": 4,
"root_name": 0,
"amp_num": 1,
"position": 2,
"alt": 3,
"fw_indicator": "FW",
"rev_indicator": "REV"
}
"Keyword" | "Description" |
---|---|
"alt" | 0 based index of the field containing the alternative primer indicator |
"root_name" | custom field used for a common name |
Examples for primers complying to the generic scheme above:
Target-Gene_1_FW
Target-Gene_1_REV
Target-Gene_2_FW
Target-Gene_2_REV
Target-Gene_2_FW_v2
Target-Gene_2_REV_v2
Target-Gene_3_FW
Target-Gene_3_REV
...
If no custom scheme is supplied, the current default is the Artic SARS-CoV2 v5.3.2 primer scheme: Generic primer scheme (see also "notramp/resources/artic_nCoV_scheme_v5.json"):
{
"sep": "_",
"min_len": 4,
"max_len": 5,
"root_name": 0,
"amp_scale": 1,
"amp_num": 2,
"position": 3,
"iteration": 4,
"alt": 5,
"fw_indicator": "LEFT",
"rev_indicator": "RIGHT"
}
required:
- Python >= 3.7
- minimap2
recommended:
- psutil
optional:
- matplotlib (for figures if desired)