`v-annotate.pl` example usage, command-line options and alert information

v-annotate.pl example usage
- example annotation of norovirus sequences
- example of using --alt_pass to change alerts from fatal to non-fatal
v-annotate.pl command-line options
Basic Information on v-annotate.pl alerts
Additional information on v-annotate.pl alerts
Expendable features: allowing sequences to pass despite fatal alerts for specific features
Limiting memory usage and multi-threading
Alternative parallelization using a cluster

`v-annotate.pl` example usage

v-annotate.pl uses previously created VADR models from v-build.pl to analyze and annotate sequences in an input sequence file. As part of the analysis of the sequences, more than 40 types of unexpected characteristics, or alerts are detected and reported in the output. Most of the alerts are fatal in that if a sequence has one or more fatal alerts, they will be designated as failing sequences. Sequences with zero fatal alerts are designated as passing sequences. The types of alerts are described further below.

NOTE: the examples below are for norovirus and demonstrate the typical usage of vadr. For examples specific to SARS-CoV-2 see:

https://github.com/ncbi/vadr/wiki/Coronavirus-annotation

To determine the command-line usage of v-annotate.pl (or any VADR script), use the -h option, like this:

v-annotate.pl -h

You'll see something like the following output:

# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.4 (Dec 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:    Thu Dec 16 09:38:22 2021
#
Usage: v-annotate.pl [-options] <fasta file to annotate> <output directory to create>

The first few lines are the banner which show the name of the VADR script being run along with the version and release date. This is followed by the time and date the command was executed. The Usage: line details the expected command line arguments. v-annotate.pl takes as input two command line arguments, a fasta file with sequences to analyze and annotate (<fasta file to annotate>) and and the name of the output directory you want it to create (<output directory to create>) and populate with output files.

After that comes a list of all available command-line options. These are explained in more detail below.

Here is an example v-annotate.pl command using the fasta file vadr/documentation/annotate-files/noro.9.fa with 9 norovirus sequences, and creating an output directory with va-noro.9:

v-annotate.pl $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-noro.9

The standard output of v-annotate.pl that is printed to the screen (which is also output to the .log output file) begins with the banner and date again followed by a list of relevant environment variables, the command line arguments used and any command line options used:

# date:              Thu Dec 16 09:38:53 2021
# $VADRBIOEASELDIR:  /home/nawrocki/vadr-install-dir/Bio-Easel
# $VADRBLASTDIR:     /home/nawrocki/vadr-install-dir/ncbi-blast/bin
# $VADREASELDIR:     /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRINFERNALDIR:  /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRMODELDIR:     /home/nawrocki/vadr-install-dir/vadr-models-calici
# $VADRSCRIPTSDIR:   /home/nawrocki/vadr-install-dir/vadr
#
# sequence file:     /home/nawrocki/vadr-install-dir/vadr/documentation/annotate-files/noro.9.fa
# output directory:  va-noro.9

No command line options were used in our example output, but if they were information on them would have appeared after the output directory line.

Next, information is output about each step the script is proceeding through. When each step is completed, the elapsed time for that step is output.

v-annotate.pl will use the default VADR model library (CM file $VADRMODELDIR/vadr.cm, model info file $VADRMODELDIR/vadr.minfo and BLAST DBs in $VADRMODELDIR) to analyze the sequences in noro.9.fa, and will create an output directory named va-noro.9 and populate it with many output files.

There are four main stages to v-annotate.pl

Classification: each sequence S in the input file is compared against all models in the model library to determine the best-matching (highest-scoring) model M(S) for each sequence.
Coverage determination: each sequence S is compared again to M(S) to determine the coverage of S, which is basically the fraction of nucleotides in S that appear to be homologous (sufficiently similar) to M(S).
Alignment and feature mapping: each sequence S is aligned to M(S) and based on that alignment, features of M(S) are mapped onto S.
Protein validation of CDS features: for each sequence S that has 1 or more predicted CDS features, blastx or hmmsearch is used to compare the predicted CDS and the full sequence S against the VADR library BLAST or HMM DB.

The output of v-annotate.pl lists one or more steps per stage. The first two steps are:

# Validating input                                                                        ... done. [    0.7 seconds]
# Classifying sequences (9 seqs)                                                          ... done. [   31.4 seconds]

The first step validates that the VADR library .minfo file being used corresponds to the VADR library .cm file. Next, the classification stage is performed. After that, each model that is M(S) for at least one S is run separately through the coverage determination stage for all of its sequences:

# Determining sequence coverage (NC_001959: 1 seq)                                        ... done. [    1.0 seconds]
# Determining sequence coverage (NC_008311: 2 seqs)                                       ... done. [    2.9 seconds]
# Determining sequence coverage (NC_029645: 2 seqs)                                       ... done. [    1.1 seconds]
# Determining sequence coverage (NC_039477: 2 seqs)                                       ... done. [    3.0 seconds]
# Determining sequence coverage (NC_044854: 2 seqs)                                       ... done. [    0.9 seconds]

Next, the alignments are performed for each model, and used to map feature annotation:

# Aligning sequences (NC_001959: 1 seq)                                                   ... done. [    0.8 seconds]
# Aligning sequences (NC_008311: 2 seqs)                                                  ... done. [   10.9 seconds]
# Aligning sequences (NC_029645: 2 seqs)                                                  ... done. [    1.5 seconds]
# Aligning sequences (NC_039477: 2 seqs)                                                  ... done. [   11.6 seconds]
# Aligning sequences (NC_044854: 2 seqs)                                                  ... done. [    0.9 seconds]
# Determining annotation                                                                  ... done. [    0.6 seconds]

The classification and alignment stages are typically the slowest. The protein validation stage is usually relatively fast:

# Validating proteins with blastx (NC_001959: 1 seq)                                      ... done. [    2.0 seconds]
# Validating proteins with blastx (NC_008311: 2 seqs)                                     ... done. [    1.1 seconds]
# Validating proteins with blastx (NC_029645: 2 seqs)                                     ... done. [    0.8 seconds]
# Validating proteins with blastx (NC_039477: 2 seqs)                                     ... done. [    1.0 seconds]
# Validating proteins with blastx (NC_044854: 2 seqs)                                     ... done. [    0.7 seconds]

The only remaining steps are to create the output files:

# Generating feature table output                                                         ... done. [    0.0 seconds]
# Generating tabular output                                                               ... done. [    0.0 seconds]

After the output files are generated, a summary of the results is output, starting with a list of all models that were the best matching model to one or more sequences:

# Summary of classified sequences:
#
#                                      num   num   num
#idx  model      group      subgroup  seqs  pass  fail
#---  ---------  ---------  --------  ----  ----  ----
1     NC_008311  Norovirus  GV           2     1     1
2     NC_029645  Norovirus  GIII         2     2     0
3     NC_039477  Norovirus  GII          2     2     0
4     NC_044854  Norovirus  GI           2     2     0
5     NC_001959  Norovirus  GI           1     1     0
#---  ---------  ---------  --------  ----  ----  ----
-     *all*      -          -            9     8     1
-     *none*     -          -            0     0     0
#---  ---------  ---------  --------  ----  ----  ----

The above table is also saved to a file with the suffix .mdl, and the format is described in more detail here.

The nine input sequences collectively had five different best-matching models, all Norovirus models with the subgroups shown above. (The groups and subgroups shown above were input as command-line options to v-build.pl when the models were built. An example is here. Eight of the nine sequences passed and one failed.

Next, the alerts reported for the sequences are summarized:

# Summary of reported alerts:
#
#     alert     causes   short                            per    num   num  long
#idx  code      failure  description                     type  cases  seqs  description
#---  --------  -------  ---------------------------  -------  -----  ----  -----------
1     mutendcd  yes      MUTATION_AT_END              feature      1     1  expected stop codon could not be identified, predicted CDS stop by homology is invalid
2     cdsstopn  yes      CDS_HAS_STOP_CODON           feature      1     1  in-frame stop codon exists 5' of stop position predicted by homology to reference
3     cdsstopp  yes      CDS_HAS_STOP_CODON           feature      1     1  stop codon in protein-based alignment
4     indf5pst  yes      INDEFINITE_ANNOTATION_START  feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
5     indf3pst  yes      INDEFINITE_ANNOTATION_END    feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
#---  --------  -------  ---------------------------  -------  -----  ----  -----------

For the nine sequences, there were five different alert codes reported, each with one case for a single sequence. This tabular summary does not indicate to which sequence(s) each alert pertains. We can find that out from other output files as discussed further below.

Next, the list of output files created by v-annotate.pl is printed, along with elapsed time:

# Output printed to screen saved in:                              va-noro.9.vadr.log
# List of executed commands saved in:                             va-noro.9.vadr.cmd
# List and description of all output files saved in:              va-noro.9.vadr.filelist
# esl-seqstat -a output for input fasta file saved in:            va-noro.9.vadr.seqstat
# 5 column feature table output for passing sequences saved in:   va-noro.9.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in:   va-noro.9.vadr.fail.tbl
# list of passing sequences saved in:                             va-noro.9.vadr.pass.list
# list of failing sequences saved in:                             va-noro.9.vadr.fail.list
# list of alerts in the feature tables saved in:                  va-noro.9.vadr.alt.list
# fasta file with passing sequences saved in:                     va-noro.9.vadr.pass.fa
# fasta file with failing sequences saved in:                     va-noro.9.vadr.fail.fa
# per-sequence tabular annotation summary file saved in:          va-noro.9.vadr.sqa
# per-sequence tabular classification summary file saved in:      va-noro.9.vadr.sqc
# per-feature tabular summary file saved in:                      va-noro.9.vadr.ftr
# per-model-segment tabular summary file saved in:                va-noro.9.vadr.sgm
# per-model tabular summary file saved in:                        va-noro.9.vadr.mdl
# per-alert tabular summary file saved in:                        va-noro.9.vadr.alt
# alert count tabular summary file saved in:                      va-noro.9.vadr.alc
# alignment doctoring tabular summary file saved in:              va-noro.9.vadr.dcr
#
# All output files created in directory ./va-noro.9/
#
# Elapsed time:  00:01:14.88
#                hh:mm:ss
# 
[ok]

All of these files were created in the newly created directory va-noro.9. What follows is a brief discussion of some of these output file types. More information on these files and their formats can be found here.

The first three files are the .log file, which is the same as the standard output printed to the screen currently being discussed, the .cmd file, and the .filelist file which lists the output files created by v-annotate.pl. Next comes a .seqstat file with lengths for each sequence in the input file.

v-annotate.pl also creates 5-column tab-delimited feature table files that end with the suffix .tbl. There is a separate file for passing (va-noro9.vadr.pass.tbl) and failing (va-noro9.vadr.fail.tbl) sequences. The format of the .tbl files is described here: https://www.ncbi.nlm.nih.gov/Sequin/table.html. These files contain some of the same information as the .ftr files discussed above, with some additional information on qualifiers for features read from the model info file. From the va-noro9.vadr.pass.tbl file, the feature table for KY887602 is:

>Feature KY887602.1
<1	5083	gene
			gene	ORF1
<1	5083	CDS
			product	nonstructural polyprotein
			codon_start	2
			protein_id	KY887602.1_1
<1	979	mat_peptide
			product	p48
			protein_id	KY887602.1_1
980	2077	mat_peptide
			product	NTPase
			protein_id	KY887602.1_1
2078	2608	mat_peptide
			product	p22
			protein_id	KY887602.1_1
2609	3007	mat_peptide
			product	VPg
			protein_id	KY887602.1_1
3008	3550	mat_peptide
			product	Pro
			protein_id	KY887602.1_1
3551	5080	mat_peptide
			product	RdRp
			protein_id	KY887602.1_1
5064	6686	gene
			gene	ORF2
5064	6686	CDS
			product	VP1
			protein_id	KY887602.1_2
6686	7492	gene
			gene	ORF3
6686	7492	CDS
			product	VP2
			protein_id	KY887602.1_3

In the .fail.tbl file, each sequence's feature table contains a special section at the end of the table that lists errors due to reported fatal alerts. For example, at the end of va-noro9.vadr.fail.tbl:

Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:VF1) in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]; seq-coords:5275..5277:+; mdl-coords:5300..5302:+; mdl:NC_008311;
ERROR: CDS_HAS_STOP_CODON: (CDS:VF1) stop codon in protein-based alignment [-]; seq-coords:5275..5277:+; mdl-coords:5300..5302:+; mdl:NC_008311;
ERROR: INDEFINITE_ANNOTATION_END: (CDS:VF1) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]; seq-coords:5650..5685:+; mdl-coords:5710..5710:+; mdl:NC_008311;
ERROR: INDEFINITE_ANNOTATION_START: (CDS:VP2) protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]; seq-coords:6656..6709:+; mdl-coords:6681..6681:+; mdl:NC_008311;

Note that this file lists only four ERRORs while the .alt output file below lists five alerts. Not all fatal alerts will be printed to this .fail.tbl file, because when specific pairs of alerts occur, only one is output to reduce the number of overlapping or redundant problems reported to the submitter/user. In this case the mutendcd (MUTATION_AT_END) alert is omitted in the .fail.tbl file because it occurs in combination with a cdsstopn (CDS_HAS_STOP_CODON) alert. But this is rare, as only one alert (mutendcd) can possibly be omitted. See the far right column of the this table for which alerts, when present in combination with mutendcd causes it to be omitted. Only alerts output to the .fail.tbl table are output to the .alt.list file.

The next three output files all end in .list. The .pass.list and .fail.list files are simply lists of all passing and failing sequences, respectively, with each sequence listed on a separate line. The .alt.list file lists all alerts that cause errors in the .fail.tbl file in a four field tab-delimited format described here. The va-noro9.vadr.alt.list file lists the four alerts/errors in the .fail.tbl file shown above:

#sequence	model	feature-type	feature-name	error	seq-coords	mdl-coords	error-description
JN975492.1	NC_008311	CDS	VF1	CDS_HAS_STOP_CODON	5275..5277:+	5300..5302:+	in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]
JN975492.1	NC_008311	CDS	VF1	CDS_HAS_STOP_CODON	5275..5277:+	5300..5302:+	stop codon in protein-based alignment [-]
JN975492.1	NC_008311	CDS	VF1	INDEFINITE_ANNOTATION_END	5650..5685:+	5710..5710:+	protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]
JN975492.1	NC_008311	CDS	VP2	INDEFINITE_ANNOTATION_START	6656..6709:+	6681..6681:+	protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]

After that are two FASTA-formatted sequence files. One of these files includes all passing sequences (va-noro.9.vadr.pass.fa) and the other includes all failing sequences (va-noro.9.vadr.fail.fa).

After these two FASTA files are eight tabular summary files that end with three letter suffixes:

suffix	description	reference
`.alc`	per-alert code information (counts)	description of format
`.alt`	per-alert instance information	description of format
`.ftr`	per-feature information	description of format
`.mdl`	per-model information	description of format
`.sgm`	per-segment information	description of format
`.sqa`	per-sequence annotation information	description of format
`.sqc`	per-sequence classification information	description of format
`.dcr`	alignment doctoring information	description of format

The contents of the .mdl and .alc files were already output by v-annotate.pl as covered above. To get more information on each sequence, see the .sqa and .sqc files. The .sqc file (va-noro.9.vadr.sqc) includes information on the classification of each sequence:

#seq  seq          seq                                   sub                    seq    mdl         num                             sub     score  diff/  seq   
#idx  name         len  p/f   ant  model1     grp1       grp1   score  sc/nt    cov    cov  bias  hits  str  model2     grp2       grp2     diff     nt  alerts
#---  ----------  ----  ----  ---  ---------  ---------  ----  ------  -----  -----  -----  ----  ----  ---  ---------  ---------  -----  ------  -----  ------
1     KY887602.1  7547  PASS  yes  NC_039477  Norovirus  GII   8142.8  1.079  1.000  0.997  12.5     1    +  NC_044046  Norovirus  GVIII  4348.2  0.576  -     
2     KT818729.1   243  PASS  yes  NC_044854  Norovirus  GI     170.4  0.701  0.996  0.031     0     1    +  NC_044932  Norovirus  GII      90.0  0.370  -     
3     EU437710.1   291  PASS  yes  NC_001959  Norovirus  GI     249.4  0.857  1.000  0.038     0     1    +  NC_044855  Norovirus  GIV     161.5  0.555  -     
4     DQ288307.1  1094  PASS  yes  NC_029645  Norovirus  GIII   973.4  0.890  1.000  0.150   6.2     1    +  NC_001959  Norovirus  GI      671.1  0.613  -     
5     AY237464.1   255  PASS  yes  NC_039477  Norovirus  GII    221.1  0.867  1.000  0.034     0     1    +  NC_044047  Norovirus  GVII    113.5  0.445  -     
6     KF475958.1   275  PASS  yes  NC_044854  Norovirus  GI     248.0  0.902  1.000  0.036     0     1    +  NC_044932  Norovirus  GII     164.0  0.596  -     
7     AB713840.1   347  PASS  yes  NC_008311  Norovirus  GV     330.6  0.953  1.000  0.047   0.2     1    +  NC_040876  Norovirus  GII     239.5  0.690  -     
8     JN585032.1   286  PASS  yes  NC_029645  Norovirus  GIII   242.3  0.847  0.997  0.039   0.1     1    +  NC_039897  Norovirus  GI      154.8  0.541  -     
9     JN975492.1  7286  FAIL  yes  NC_008311  Norovirus  GV    4666.2  0.640  1.000  0.987  16.8     1    +  NC_044047  Norovirus  GVII   3382.8  0.464  -

This file includes per-sequence information on whether each sequence passed or failed, its best and second-best matching model, and scores and coverage. The difference in score between the best and second-best model gives an indication of how confidently classified the sequence is to the best-matching model. (The qstgroup and qstsbgrp alerts are reported if these scores are not sufficiently far apart to alert the user that a sequence is not confidently classified.) Note that the sequence JN975492.1 is the only sequence that failed. It has no seq alerts listed in the final field, so the fatal alerts that caused it to fail must have been per-feature alerts. These can be seen in the .ftr and .alt files. An explanation of all fields in the .sqc file type is here.

Next, take a look at the first few lines of the .ftr file (va-noro.9.vadr.ftr):

#     seq          seq                   ftr          ftr                         ftr  ftr  par                                                                                                        seq         model  ftr   
#idx  name         len  p/f   model      type         name                        len  idx  idx  str  n_from  n_to  n_instp  trc    5'N  3'N  p_from  p_to           p_instp  p_sc  nsa  nsn        coords        coords  alerts
#---  ----------  ----  ----  ---------  -----------  -------------------------  ----  ---  ---  ---  ------  ----  -------  -----  ---  ---  ------  ----  ----------------  ----  ---  ---  ------------  ------------  ------
1.1   KY887602.1  7547  PASS  NC_039477  gene         ORF1                       5083    1   -1    +       1  5083        -  5'       0    0       -     -                 -     -    1    0     1..5083:+    22..5104:+  -     
1.2   KY887602.1  7547  PASS  NC_039477  CDS          nonstructural_polyprotein  5083    2   -1    +       1  5083        -  5'       0    0       2  5080                 -  9094    1    0     1..5083:+    22..5104:+  -     
1.3   KY887602.1  7547  PASS  NC_039477  gene         ORF2                       1623    3   -1    +    5064  6686        -  no       0    0       -     -                 -     -    1    0  5064..6686:+  5085..6707:+  -     
1.4   KY887602.1  7547  PASS  NC_039477  CDS          VP1                        1623    4   -1    +    5064  6686        -  no       0    0    5064  6683                 -  2878    1    0  5064..6686:+  5085..6707:+  -     
1.5   KY887602.1  7547  PASS  NC_039477  gene         ORF3                        807    5   -1    +    6686  7492        -  no       0    0       -     -                 -     -    1    0  6686..7492:+  6707..7513:+  -     
1.6   KY887602.1  7547  PASS  NC_039477  CDS          VP2                         807    6   -1    +    6686  7492        -  no       0    0    6686  7486                 -  1393    1    0  6686..7492:+  6707..7513:+  -     
1.7   KY887602.1  7547  PASS  NC_039477  mat_peptide  p48                         979    7    2    +       1   979        -  5'       0    0       -     -                 -     -    1    0      1..979:+    22..1000:+  -     
1.8   KY887602.1  7547  PASS  NC_039477  mat_peptide  NTPase                     1098    8    2    +     980  2077        -  no       0    0       -     -                 -     -    1    0   980..2077:+  1001..2098:+  -     
1.9   KY887602.1  7547  PASS  NC_039477  mat_peptide  p22                         531    9    2    +    2078  2608        -  no       0    0       -     -                 -     -    1    0  2078..2608:+  2099..2629:+  -     
1.10  KY887602.1  7547  PASS  NC_039477  mat_peptide  VPg                         399   10    2    +    2609  3007        -  no       0    0       -     -                 -     -    1    0  2609..3007:+  2630..3028:+  -     
1.11  KY887602.1  7547  PASS  NC_039477  mat_peptide  Pro                         543   11    2    +    3008  3550        -  no       0    0       -     -                 -     -    1    0  3008..3550:+  3029..3571:+  -     
1.12  KY887602.1  7547  PASS  NC_039477  mat_peptide  RdRp                       1530   12    2    +    3551  5080        -  no       0    0       -     -                 -     -    1    0  3551..5080:+  3572..5101:+  -     
#

This file includes information on each annotated feature, organized by sequence. The lines above show the annotated features for the first sequence KY887602.1 and include information on the feature length, strand, positions in the sequence and in the reference model, whether it is truncated or not, and where the blastx protein validation stage predicted coordinates are. The seq coords and mdl coords lines near the end show the sequence and model coordinates of each feature in the VADR coordinate string format, described here. The final column lists any alerts pertaining to each feature. For the features in this sequence there are no alerts, but for the final sequence, some alerts are listed for the second CDS. An explanation of all fields in the .ftr file type is here.

Another important file is the .alt output file (va-noro9.vadr.alt) which includes one line per alert reported:

#      seq                    ftr   ftr   ftr  alert           alert                                 seq  seq           mdl  mdl  alert 
#idx   name        model      type  name  idx  code      fail  description                        coords  len        coords  len  detail
#----  ----------  ---------  ----  ----  ---  --------  ----  ---------------------------  ------------  ---  ------------  ---  ------
9.1.1  JN975492.1  NC_008311  CDS   VF1     6  mutendcd  yes   MUTATION_AT_END              5683..5685:+    3  5708..5710:+    3  expected stop codon could not be identified, predicted CDS stop by homology is invalid [TCA]
9.1.2  JN975492.1  NC_008311  CDS   VF1     6  cdsstopn  yes   CDS_HAS_STOP_CODON           5275..5277:+    3  5300..5302:+    3  in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]
9.1.3  JN975492.1  NC_008311  CDS   VF1     6  cdsstopp  yes   CDS_HAS_STOP_CODON           5275..5277:+    3  5300..5302:+    3  stop codon in protein-based alignment [-]
9.1.4  JN975492.1  NC_008311  CDS   VF1     6  indf3pst  yes   INDEFINITE_ANNOTATION_END    5650..5685:+   36  5710..5710:+    1  protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]
9.2.1  JN975492.1  NC_008311  CDS   VP2     8  indf5pst  yes   INDEFINITE_ANNOTATION_START  6656..6709:+   54  6681..6681:+    1  protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]

All alerts are for the JN975492.1 sequence. Four are for the VF1 CDS and 1 is for the VP2 CDS. The alert codes are listed in the seventh column, along with a brief description in the eigth column. Then the sequence and model coordinates pertaining to the alert and the lengths of those regions are listed in columns 9 to 12. A more detailed description of the problem can be found in the final column. All possible alerts are listed in the alert table. For some examples of different types of alerts see here

Example of using the `v-annotate.pl` `--alt_pass` and `--alt_fail` to change alerts from fatal to non-fatal and vice versa

One way to change the behavior of v-annotate.pl is to change which alerts are fatal or non-fatal. Most alerts are fatal by default, but some are not, as shown in the alert table. Some alerts are always fatal in that they cannot be changed, but all others can be toggled between fatal or non-fatal using the --alt_pass and --alt_fail options.

For example, we can make the JN976492.1 sequence above pass by making the five observed alerts (mutendcd, cdsstopn, cdsstopp, indf3pst, and indf5pst) by rerunning the v-annotate.pl above with this command:

v-annotate.pl --alt_pass mutendcd,cdsstopn,cdsstopp,indf3pst,indf5pst $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-pass-noro.9

To supply multiple alerts with --alt_pass or --alt_fail, separate them by a , without any whitespace, like above.

The output will look very similar to the earlier run, but the summary information printed at the end will show that no sequences fail this time, despite the same alerts being reported.

# Summary of classified sequences:
#
#                                      num   num   num
#idx  model      group      subgroup  seqs  pass  fail
#---  ---------  ---------  --------  ----  ----  ----
1     NC_008311  Norovirus  GV           2     2     0
2     NC_039477  Norovirus  GII          2     2     0
3     NC_044854  Norovirus  GI           2     2     0
4     NC_029645  Norovirus  GIII         2     2     0
5     NC_001959  Norovirus  GI           1     1     0
#---  ---------  ---------  --------  ----  ----  ----
-     *all*      -          -            9     9     0
-     *none*     -          -            0     0     0
#---  ---------  ---------  --------  ----  ----  ----
#
# Summary of reported alerts:
#
#     alert     causes   short                            per    num   num  long
#idx  code      failure  description                     type  cases  seqs  description
#---  --------  -------  ---------------------------  -------  -----  ----  -----------
1     mutendcd  no       MUTATION_AT_END              feature      1     1  expected stop codon could not be identified, predicted CDS stop by homology is invalid
2     cdsstopn  no       CDS_HAS_STOP_CODON           feature      1     1  in-frame stop codon exists 5' of stop position predicted by homology to reference
3     cdsstopp  no       CDS_HAS_STOP_CODON           feature      1     1  stop codon in protein-based alignment
4     indf5pst  no       INDEFINITE_ANNOTATION_START  feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
5     indf3pst  no       INDEFINITE_ANNOTATION_END    feature      1     1  protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
#---  --------  -------  ---------------------------  -------  -----  ----  -----------

The --alt_fail <s> option works the same way as alt_pass <s> but alert codes in <s> should be non-fatal by default.

Alternatively, if you want to relax the stringency of alerts only for specific features you can do that by modifying the modelinfo input file as explained below.

Example of using the `-p` option to parallelize `v-annotate.pl`

The most time-consuming stages of v-annotate.pl (classification, coverage determination and alignment) can be parallelized on a cluster by splitting up the input sequence file randomly into multiple files, and running each as a separate job. This is most beneficial for large input sequence files. Parallel mode is invoked with the -p option. By default, v-annotate.pl will consult the file $VADRSCRIPTSDIR/vadr.qsubinfo to read the command prefix and suffix for submitting jobs to the cluster. This file is set up to use Univa Grid Engine (UGE 8.5.5) and specific flags used on the NCBI system, but you can either modify this file to work with your own cluster or create a new file <s> and use the option -q <s> to read that file. The $VADRSCRIPTSDIR/vadr.qsubinfo has comments at the top that explain the format of the file. Email eric.nawrocki@nih.gov for help.

To repeat the above v-annotate.pl run in parallel mode, use this command:

v-annotate.pl -p $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-parallel-noro.9

The output will look very similar to the run without -p, but with additional lines of output explaining that jobs have been submitted and are running on the compute farm:

# Submitting 1 cmscan classification job(s) to the farm                               ... 
# Waiting a maximum of 500 minutes for all farm jobs to finish                        ... 
#	   0 of    1 jobs finished (0.2 minutes spent waiting)
#	   0 of    1 jobs finished (0.5 minutes spent waiting)
#	   0 of    1 jobs finished (0.8 minutes spent waiting)
#	   0 of    1 jobs finished (1.0 minutes spent waiting)
#	   1 of    1 jobs finished (1.2 minutes spent waiting)
# done. [   75.7 seconds]
# Submitting 1 cmsearch coverage determination job(s) (NC_001959: 1 seqs) to the farm ... 
# Waiting a maximum of 500 minutes for all farm jobs to finish                        ... 
#	   1 of    1 jobs finished (0.2 minutes spent waiting)
# done. [   15.2 seconds]

Usage of -p will not affect the output of v-annotate.pl other than these lines about the status of jobs, but it can make processing of large sequence files significantly faster depending on how busy the cluster is.

`v-annotate.pl` command-line options

To get a list of command-line options, execute:

v-annotate.pl -h

This will output the usage and available command-line options. Each option has a short description, but additional information on some of these options can be found below. For v-annotate.pl the available options are split into nine different categories, each explained in their own subsection below.

In the tables describing options below, <s> represents a string, <x> indicates a floating point number and <n> represents an integer.

`v-annotate.pl` basic options

......option....	explanation
`-f`	if `<output directory>` already exists, then using this option will cause it to be overwritten, otherwise the progam exits in error
`-v`	verbose mode: all commands will be output to standard output as they are run
`--atgonly`	only consider ATG as a valid start codon, regardless of model's translation table
`--minpvlen <n>`	set the minimum length in nucleotides for CDS/mat_peptide/gene features to be output to feature tables and for protein validation analysis to `<n>`, default `<n>` is 30
`--nkb <n>`	set the target number of Kb of sequence for each alignment job and/or chunk (with --split) to `<n>` Kb (thousand nucleotides), default `<n>` is `300`
`--keep`	keep additional output files that are normally removed

`v-annotate.pl` options for specifying expected sequence classification

..........option..........	explanation
`--group <s>`	specify that the expected classification of all sequences is group `<s>`, sequences determined to not be in this group will trigger an incgroup alert
`--subgroup <s2>`	specify that the expected classification of all sequences is subgroup `<s>` within group `<s2>` from `--group <s2>`, sequences determined to not be in this group will trigger an incsubgrp alert; requires `--group`

`v-annotate.pl` options for controlling which alerts are fatal and cause a sequence to FAIL

............option............	explanation
`--alt_list`	output summary of all alerts and then exit
`--alt_pass <s>`	specify that alert codes in comma-separated string `<s>` are non-fatal (do not cause a sequence to fail), all alert codes listed must be fatal by default
`--alt_fail <s>`	specify that alert codes in comma-separated string `<s>` are fatal (cause a sequence to fail), all alert codes listed must be non-fatal by default
`--alt_mnf_yes <s>`	specify that alert codes in comma-separated string `<s>` for 'misc_not_failure' features cause misc_feature-ization, not failure as explained more here
`--alt_mnf_no <s>`	specify that alert codes in comma-separated string `<s>` for 'misc_not_failure' features cause failure, not misc-feature-ization as explained more here

`v-annotate.pl` options for ignoring specific keys in the input model info (.minfo) file

............option............	explanation
`--ignore_mnf`	ignore non-zero 'misc_not_feature' values in `modelinfo` file, set to 0 for all features/models
`--ignore_isdel`	ignore non-zero 'is_deletable' values in `modelinfo` file, set to 0 for all features/models
`--ignore_afset`	ignore non-zero 'alternative_ftr_set' and 'alternative_ftr_set_subn' values in `modelinfo` file
`--ignore_afsetsubn`	ignore non-zero 'alternative_ftr_set_subn' values in `modelinfo` file

`v-annotate.pl` options related to model files

.......option.......	explanation
`-m <s>`	use the CM file `<s>`, instead of the default CM file ($VADRMODELDIR/vadr.cm)
`-a <s>`	use HMM file `<s>` instead of the default HMM file ($VADRMODELDIR/vadr.hmm)
`-i <s>`	use the VADR model info file `<s>`, instead of the default model info file ($VADRMODELDIR/vadr.minfo)
`-n <s>`	use the blastn DB file `<s>` when necessary, instead of the default blastn DB file ($VADRMODELDIR/vadr.fa), only used if `-s` or `-r` is also used
`-x <s>`	specify that the blastx database files to use for protein validation are in dir `<s>`, instead of the default directory ($VADRMODELDIR)
`--mkey <s>`	specify that .cm, .minfo, and blastn .fa files in $VADRMODELDIR start with key `<s>`, not 'vadr'
`--mdir <s>`	specify that all model files to use are in the directory `<s>`, not in $VADRMODELDIR
`--mlist <s>`	specify that only the subset of models listed in the file `<s>` be used

`v-annotate.pl` options for controlling output feature table

.......option.......	explanation
`--nomisc`	in feature table, never change feature to `misc_feature`
`--notrim`	in feature table, do not trim coordinate start and stops due to Ns at beginning or end of features for all feature types
`--noftrtrim <s>`	in feature table, do not trim coordinate start and stops due to Ns at beginning or end of features for feature types listed in the comma-delimited string `<s>` (no spaces)
`--noprotid`	in feature table, don't add protein_id for CDS and mat_peptide features
`--forceprotid`	in feature table, force protein_id value to be sequence name, then idx

`v-annotate.pl` options for controlling thresholds related to alerts

In the table below, <n> represents a positive interger argument and <x> represents a positive floating-point argument.

...........option...........	relevant alert code(s)	relevant error(s)	default value that triggers alert	explanation
`--lowsc <x>`	lowscore	LOW_SCORE	< 0.3	set bits per nt threshold for alert to `<x>`
`--indefclass <x>`	indfclas	INDEFINITE_CLASSIFICATION	< 0.03	set bits per nt difference threshold for alert between top two models (not in same subgroup) to `<x>`
`--incspec <x>`	incgroup, incsubgrp	INCORRECT_SPECIFIED_GROUP, INCORRECT_SPECIFIED_SUBGROUP	< 0.2	set bits per nt difference threshold for alert between best-matching model `<m>` and highest-scoring model in specified group `<s1>` (from `--group <s1>`) or subgroup `<s2>` (from `--subgroup <s2>`), where `<m>` is not in group/subgroup `<s1>`/`<s2>` to `<x>`
`--lowcov <x>`	lowcovrg	LOW_COVERAGE	< 0.9	set fractional coverage threshold for alert to `<x>`
`--dupregolp <n>`	dupregin	DUPLICATE_REGIONS	>= 20	set min number of model position overlap for alert to `<n>` positions
`--dupregsc <x>`	dupregin	DUPLICATE_REGIONS	>= 10.0	set min bit score of weaker overlapping hit to `<x>` bits
`--indefstr <x>`	indfstrn	INDEFINITE_STRAND	>= 25.0	set bit score of weaker strand hit for alert to `<x>`
`--lowsim5seq <n>`	lowsim5s	LOW_SIMILARITY_START	>= 15	set length (nt) threshold for alert to `<n>`
`--lowsim3seq <n>`	lowsim3s	LOW_SIMILARITY_END	>= 15	set length (nt) threshold for alert to `<n>`
`--lowsimiseq <n>`	lowsimis	LOW_SIMILARITY	>= 1	set length (nt) threshold for alert to `<n>`
`--lowsim5ftr <n>`	lowsim5f	LOW_FEATURE_SIMILARITY_START	>= 5	set length (nt) threshold for alert to `<n>`
`--lowsim3ftr <n>`	lowsim3f	LOW_FEATURE_SIMILARITY_END	>= 5	set length (nt) threshold for alert to `<n>`
`--lowsimiftr <n>`	lowsimif	LOW_FEATURE_SIMILARITY	>= 1	set length (nt) threshold for alert to `<n>`
`--biasfrac <x>`	biasdseq	BIASED_SEQUENCE	>= 0.25	set fractional bit score threshold for biased score/total score for alert to `<x>`
`--nmiscftrthr <n>`	nmiscftr	TOO_MANY_MISC_FEATURES	>= 4	set minimum number of misc_features per sequence for alert to `<n>`
`--indefann <x>`	indf5lcc, indf5lcn, indf3lcc, indf3lcn	INDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END	< 0.8	set posterior probability threshold for non-mat_peptide features for alert to `<x>`
`--indefann_mp <x>`	indf5lcc, indf5lcn, indf3lcc, indf3lcn	INDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END	< 0.6	set posterior probability threshold for mat_peptide features for alert to `<x>`
`--fstminntt <n>`	fsthicft, fstlocft, fstukct5	POSSIBLE_FRAMESHIFT_HIGH_CONF, POSSIBLE_FRAMESHIFT_LO_CONF, POSSIBLE_FRAMESHIFT	>= 4	set maximum allowed length of aligned region in different frame in which frame is not restored before CDS end to `<n>`
`--fstminnti <n>`	fsthicfi, fstlocfi, fstukcfi	POSSIBLE_FRAMESHIFT_HIGH_CONF, POSSIBLE_FRAMESHIFT_LO_CONF, POSSIBLE_FRAMESHIFT	>= 6	set maximum allowed length of aligned region in different frame in which frame is restored before CDS end to `<n>`
`--fsthighthr <x>`	fsthicnf	POSSIBLE_FRAMESHIFT_HIGH_CONF	>= 0.8	set average posterior probability threshold for potentially frameshifted region for high confidence alert to `<x>`
`--fstlowthr <x>`	fstlocnf	POSSIBLE_FRAMESHIFT_LOW_CONF	>= 0.0	set average posterior probability threshold for potentially frameshifted region for low confidence alert to `<x>`
`--xalntol <n>`	indf5pst, indf3pst	INDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END	> 5	set maximum allowed difference in nucleotides between predicted blastx and CM start/end without alert to `<n>` (blastx coordinates must be internal to CM coordinates)
`--xmaxins <n>`	insertnp	INSERTION_OF_NT	> 27	set maximum allowed nucleotide insertion length in blastx validation alignment without alert to `<n>`
`--xmaxdel <n>`	deletinp	DELETION_OF_NT	> 27	set maximum allowed nucleotide deletion length in blastx validation alignment without alert to `<n>`
`--nmaxins <n>`	insertnn	INSERTION_OF_NT	> 27	set maximum allowed nucleotide insertion length in CDS nt alignment without alert to `<n>`
`--nmaxdel <n>`	deletinn	DELETION_OF_NT	> 27	set maximum allowed nucleotide deletion length in CDS nt alignment without alert to `<n>`
`--xlonescore <n>`	indfantp	INDEFINITE_ANNOTATION	>= 80	set minimum blastx raw score for a lone blastx hit not supported by CM analysis for alert to `<n>`
`--hlonescore <n>`	indfantp	INDEFINITE_ANNOTATION	>= 10	set minimum hmmer bit score for a lone hmmsearch hit not supported by CM analysis for alert to `<n>`

`v-annotate.pl` options for controlling cmalign alignment stage

Several options exist for controlling the command-line options that will be passed to Infernal's cmalign program in the alignment stage. For more information on these options and how they control cmalign, see the Infernal User's Guide manual page for cmalign (section 8 of http://eddylab.org/infernal/Userguide.pdf)

.........option.........	explanation
`--mxsize <n>`	set maximum allowed cmalign DP matrix size to `<n>` Mb before triggering an unexpdivg alert, default `<n>` is `16000`
`--tau <x>`	set the initial tau (probability loss) value to `<x>` (sets the `cmalign --tau` option), default `<x>` is `0.001`
`--nofixedtau`	do not fix the tau value, allow it to increase if necessary (removes the `cmalign --fixedtau` option), default is to fix tau with `cmalign --fixedtau`
`--nosub`	use alternative alignment strategy for truncated sequences (removes the `cmalign --sub --notrunc` options), default is use sub-CM alignment strategy with `cmalign --sub --notrunc`
`--noglocal`	run in local mode instead of glocal mode (removes the `cmalign -g` option), default is to use glocal mode with `cmalign -g`
`--cmindi`	force cmalign to align one sequence at a time, mainly useful for debugging

`v-annotate.pl` options for controlling glsearch alignment stage as alternative to cmalign

The glsearch program from the FASTA package can be used as an alternative to the cmalign program. For more information on these options and how they control glsearch, see the FASTA documentation (https://fasta.bioch.virginia.edu/wrp_fasta/fasta_guide.pdf).

.........option.........	explanation
`--glsearch`	align with glsearch instead of cmalign
`--gls_match <n>`	set glsearch match score to `<n> > 0` (-r option in glsearch), default is `5'
`--gls_mismatch <n>`	set glsearch mismatch score to `<n> < 0` (-r option in glsearch), default is `-3`
`--gls_gapopen <n>`	set glsearch gap open score to `<n> < 0` (-f option in glsearch), default is `-17`
`--gls_gapextend <n>`	set glsearch gap extend score to `<n> < 0` (-g option in glsearch), default is `-4`

`v-annotate.pl` options for controlling blastx protein validation stage

Below is a list of options for controlling the blastx protein validaation stage. Several of these control command-line options that will be passed to blastx. For more information on these options and how they control blastx, see the NCBI BLAST documentation (tables C1 and C4 of https://www.ncbi.nlm.nih.gov/books/NBK279684/).

.........option.........	explanation
`--xmatrix <s>`	use the substitution matrix `<s>` (sets the `blastx -matrix <s>` option), default is to use the default `blastx` matrix
`--xdrop <n>`	set the xdrop options to `<n>` (sets the `blastx` `-xdrop_ungap <n>`, `-xdrop_gap <n>` and `-xdrop_gap_final <n>` with the same `<n>`), default is to use default `blastx` values
`--xnumali <n>`	specify that the top `<n>` alignments are output by `blastx`, mostly relevant in combination with `--xlongest` (sets the `blastx -num_alignments <n>` option), default `<n>` is 20
`--xlongest`	use the longest `blastx` alignment of those returned (controlled by `--xnumali <n>`), default is to use the highest scoring alignment

`v-annotate.pl` options for using hmmer instead of blastx for protein validation

Optionally, HMMER's hmmsearch program can be used instead of blastx for the protein validation stage. CAUTION: This feature is relatively new and untested. Several of these control command-line options that will be passed to blastx. For more information on HMMER, see the HMMER user's guide (http://eddylab.org/software/hmmer/Userguide.pdf).

......option......	explanation
`--pv_hmmer`	use hmmer instead of blastx for protein validation
`--h_max`	use the `--max` option with hmmsearch
`--h_minbit <x>`	set the minimum hmmsearch bit score threshold to `<x>`, the default `<x>` is `-10`.

`v-annotate.pl` options related to blastn-derived seeded alignment acceleration

The -s option turns on an acceleration heuristic based on a first-pass blastn alignment of each input sequence. With -s, blastn is used instead of cmscan for sequence classification, and the largest ungapped alignment region, called the 'seed', is extracted from the top hit blastn hit, and fixed for the alignment stage, such that only the sequence before and after the fixed seed is aligned with cmalign. This option was originally developed for SARS-CoV-2, for which it offers significant acceleration for many sequences which are highly similar to the SARS-CoV-2 RefSeq model.

When -s option is used, an additional output file with suffix .sda is created, with format described here.

.........option.........	explanation
`-s`	turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment
`--s_blastnws <n>`	for `-s`, set the blastn `-word_size` parameter to `<n>`, the default value for `<n>` is `7`
`--s_blastnrw <n>`	for `-s`, set the blastn `-reward` parameter to `<n>`, the default value for `<n>` is `1`
`--s_blastnpn <n>`	for `-s`, set the blastn `-penalty` parameter to `<n>`, the default value for `<n>` is `-2`
`--s_blastngo <n>`	for `-s`, set the blastn `-gapopen` parameter to `<n>`, the default value for `<n>` is `2`
`--s_blastnge <n>`	for `-s`, set the blastn `-gapextend` parameter to `<n>`, the default value for `<n>` is `1`
`--s_blastndf`	for `-s`, do not use -gapopen/-gapextend options with blastn, use default values for gap penalties
`--s_blastnsc <x>`	for `-s`, set the blastn minimum HSP score to consider to `<x>`, the default value for `<x>` is `50.0`
`--s_blastntk`	for `-s`, set blastn option `-task blastn`
`--s_blastnxd <n>`	for `-s`, set the blastn `-xdrop_gap_final` parameter to `<n>`, the default value for `<n>` is `110`
`--s_minsgmlen <n>`	for `-s`, set minimum length of ungapped region in HSP seed to `<n>`, the default value for `<n>` is `10`
`--s_allsgm`	for `-s`, keep full HSP as seed, do not enforce a minimum segment length
`--s_ungapsgm`	for `-s`, only keep max length ungapped segment of HSP, this was default behavior for vadr v1.1 to v1.3
`--s_startstop`	for `-s`, allow seed to include gaps in start/stop codons
`--s_overhang <n>`	for `-s`, set the length, in nt, of overlap between the 5' and 3' regions that are aligned with cmalign and the seed region to `<n>`, the default value for `<n>` is `100`

`v-annotate.pl` options related to replacing Ns with expected nucleotides

The -r option adds a pre-processing step to v-annotate.pl in which stretches of Ns are identified in each sequence and replaced with the expected nucleotides at the corresponding positions, when possible. This can sidestep problems with annotation that the Ns would normally cause. However, this option should be used with caution because it is based on the assumption that the missing regions match exactly to the expected nucleotide sequence that correspond to those missing regions.

Regions of Ns are identified using blastn and examining regions between hits for content of Ns. Ns in regions that satisfy the following three criteria are then replaced with the expected nucleotide at each corresponding position:

missing sequence region must be at least 5 nt (controllable with --r_minlen option)
length of missing sequence region must equal length of missing model region
missing sequence region must be >= 0.25 fraction Ns if it includes the 5' end or 3' end of the sequence, or >= 0.50 fraction Ns if it does not (controllable with --r_minfract5, --r_minfract3 and --r_minfracti options).

Additionally, as of v1.4, regions for which the length of the missing sequence region and missing model region are not identical are also potentially replaced if the following criteria are met:

length of missing model region is 10 nt or less longer than the length of missing sequence region (controllable with --r_diffmaxdel option) OR length of missing model region is 10 nt or less shorter than the length of missing sequence region (controllable with --r_diffmaxins option)
at least 1 of the nt in the missing sequence region is not an N (controllable with --r_diffminnonn option)
fraction of non-N nt in sequence region that match expected nt after "aligning" sequence region by flushing left or right with respect to differently length model region is at least 0.75 (controllable with --r_diffminfract option)
--r_diffno option is not used

When -r is used, an additional output file with suffix .rpn is created, with format described here.

...........option...........	explanation
`-r`	turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible
`--r_minlen <n>`	for `-r`, set minimum length subsequence to possibly replace Ns in to `<n>`, the default value for `<n>` is `5`
`--r_minfract5 <f>`	for `-r`, set the minimum fraction of nucleotides in a subsequence at the 5' end to trigger N replacement to `<x>`, the default value for `<x>` is `0.25`
`--r_minfract3 <f>`	for `-r`, set the minimum fraction of nucleotides in a subsequence at the 3' end to trigger N replacement to `<x>`, the default value for `<x>` is `0.25`
`--r_minfracti <f>`	for `-r`, set the minimum fraction of nucleotides in an internal subsequence to trigger N replacement to `<x>`, the default value for `<x>` is `0.5`
`--r_diffno`	do not try replacement of N rich regions if sequence and model regions are of different lengths, the default is to try if criteria defined by other `--r_diff*` options are met
`--r_diffmaxdel`	maxium allowed length difference b/t sequence and model regions (when model length > sequence length) to try replacement is `<n>` nt, the default value for `<n>` is `10`
`--r_diffmaxins`	maxium allowed length difference b/t sequence and model regions (when sequence length > model length) to try replacement is `<n>` nt, the default value for `<n>` is `10`
`--r_diffminnonn`	minimum number of non-N nts in replacement region when model and sequence region are different lengths to try replacement is `<n>`, the default value for `<n>` is `1`
`--r_diffminfract`	minimum allowed fraction of non-N nts that must match expected nt from reference model in replacement region when model and sequence region are different lengths is `<f>`, the default value for `<f>` is `0.75`
`--r_fetchr`	for `-r`, fetch features to fasta files from sequences with Ns replaced, instead of original input sequences without Ns replaced
`--r_cdsmpr`	for `-r`, identify CDS- and mat_peptide-specific alerts using subsequences fetched from sequences with Ns replaced, instead of original input sequences without Ns replaced
`--r_pvorig`	for `-r`, use original input sequences without Ns replaced in protein validation stage, instead of sequences with Ns replaced
`--r_prof`	for `-r`, use slower profile methods, not blastn, to identify Ns to replaced
`--r_list`	for `-r`, only use models listed in file `<s>` for N replacement stage
`--r_only <s>`	for `-r`, only use model named `<s>` for N replacement stage
`--r_blastnws <n>`	for `-r`, set the blastn `-word_size` parameter to `<n>`, the default value for `<n>` is `7`
`--r_blastnrw <n>`	for `-r`, set the blastn `-reward` parameter to `<n>`, the default value for `<n>` is `1`
`--r_blastnpn <n>`	for `-r`, set the blastn `-penalty` parameter to `<n>`, the default value for `<n>` is `-2`
`--r_blastngo <n>`	for `-r`, set the blastn `-gapopen` parameter to `<n>`, the default value for `<n>` is `2`
`--r_blastnge <n>`	for `-r`, set the blastn `-gapextend` parameter to `<n>`, the default value for `<n>` is `1`
`--r_blastndf`	for `-r`, do not use -gapopen/-gapextend options with blastn, use default values for gap penalties
`--r_blastnsc <x>`	for `-r`, set the blastn minimum HSP score to consider to `<x>`, the default value for `<x>` is `50.0`
`--r_blastntk`	for `-r`, set blastn option `-task blastn`
`--r_blastnxd <n>`	for `-r`, set the blastn `-xdrop_gap_final` parameter to `<n>`, the default value for `<n>` is `110`

`v-annotate.pl` options related to splitting input sequence file into chunks and processing each chunk separately and potentially in parallel

The --split option specifies that v-annotate.pl should split up the input file into chunks and processing each chunk separately and then combining results at the end after all chunks have been processed. This limits total memory usage for large input sequence files as explained more here.

........option........	explanation
`--split`	split input fasta sequence file into chunks of `<n>` Kb where `<n>` is from `--nkb <n>` (300 Kb, by default) and run each chunk separately
`--cpu <n>`	with --split or --glsearch, parallelize across `<n>` CPU threads/workers (requires --split oor --glsearch)
`--sidx <n>`	start sequence indexing at `<n>` for output files, not intended to be set by user except when debugging

`v-annotate.pl` options related to parallelization on a compute farm/cluster

........option........	explanation
`-p`	run in parallel mode so that classification, and each per-model coverage determination and alignment step is split into multiple jobs and run in parallel on a cluster
`-q <s>`	read cluster information file from file `<s>` instead of from the default file `$VADRSCRIPTSDIR/vadr.qsubinfo`
`--errcheck`	consider any output to STDERR from a parallel job as an indication the job has failed, this will cause `v-annotate.pl` to exit, default is to ignore output to STDERR

`v-annotate.pl` options related to both splitting input and parallelization on compute farm

........option........	explanation
`--wait <n>`	set the total number of minutes to wait for all jobs to finish at each stage to `<n>`, if any job is not finished this many minutes after being submitted (as indicated by the existence of an expected output file) then `v-annotate.pl` will exit in error, default `<n>` is `500`
`--maxnjobs <n>`	set the maximum number of jobs at each stage to `<n>`, default `<n>` is 2500

`v-annotate.pl` options for skipping stages

......option......	explanation
`--pv_skip`	do not perform protein validation stage for CDS
`--align_skip`	skip the `cmalign` stage, use results from previous run, this is mostly useful for debugging purposes
`--val_only`	validate CM and other input files and exit

`v-annotate.pl` options for optional output files

.......option.......	explanation
`--out_stk`	create additional per-model output stockholm alignments with `.stk` suffix
`--out_afa`	create additional per-model output aligned fasta alignments with `.afa` suffix
`--out_rpstk`	with `-r`, create additional per-model output stockholm alignments with sequences with Ns replaced with `.rpstk` suffix
`--out_rpafa`	create additional per-model output aligned fasta alignments with sequences with Ns replaced with `.rpafa` suffix
`--out_fsstk`	output frameshift stockholm alignment files with `.frameshift.stk` suffix
`--out_allfasta`	output fasta files of predicted features
`--out_nofasta`	minimize total size of output; do not output fasta files of all passing and all failing sequences
`--out_debug`	create additional output files with information on various data structures

Other `v-annotate.pl` expert options

.........option.........	explanation
`--execname <s>`	in banner and usage output, replace `v-annotate.pl` with `<s>`
`--alicheck`	for debugging purposes, check aligned sequence versus input sequence for identity
`--noseqnamemax`	do not enforce the GenBank maximum length of 50 characters for sequence names
`--minbit <x>`	set minimum cmsearch/cmscan bit score threshold to `<x>`, the default value for `<x>` is `-10`
`--origfa`	do not copy the input fasta file into output directory prior to analysis, use the original
`--msub <s>`	specify that file `<s>` lists models to substitute, each line should contain two space-delimited tokens, model listed in token 2 will substitute as best-matching model for all sequences classified as the model listed in token 1
`--xsub <s>`	specify that file `<s>` lists blastx dbs to substitute, each line should contain two space-delimited tokens, blastx db for model listed in token 2 will substitute as blastx db for all sequences classified as the model listed in token 1
`--nodcr`	never doctor alignments to shift gaps to correct start/stop codon annotation
`--forcedcrins`	force insert type alignment doctoring, requires `--cmindi`, mainly useful for debugging/testing
`--xnoid`	ignore blastx hits that are full length and 100% identical, mainly useful for testing

Information on `v-annotate.pl` alerts

To see a table with information on alerts, use the --alt_list option, like this:

v-annotate.pl --alt_list

The table below contains the same information as in the --alt_list output, with sequences organized according to whether they are fatal or not. Always fatal alert codes are always fatal and cannot be changed using the --alt_pass options. All other alert codes can be changed from fatal to non-fatal by using the --alt_pass option, or from non-fatal to fatal using the --alt_fail option. An example is included below.

In the table below, the type column reports if each alert pertains to an entire sequence or a specific annotated feature within a sequence. The causes misc_feature, not failure (if in modelinfo file shows which alerts are not fatal for expendable features as described more below.

Description of always fatal alert codes

alert code	type	causes `misc_feature`, not failure (if in modelinfo file)	short description/error name	long description
noannotn	sequence	never	NO_ANNOTATION	no significant similarity detected
revcompl	sequence	never	REVCOMPLEM	sequence appears to be reverse complemented
unexdivg	sequence	never	UNEXPECTED_DIVERGENCE	sequence is too divergent to confidently assign nucleotide-based annotation
noftrann	sequence	never	NO_FEATURES_ANNOTATED	sequence similarity to homology model does not overlap with any features
noftrant	sequence	never	NO_FEATURES_ANNOTATED	all annotated features are too short to be output to feature table
ftskipfl	sequence	never	UNREPORTED_FEATURE_PROBLEM	only fatal alerts are for feature(s) not output to feature table

Description of alerts that are fatal by default

alert code	type	causes `misc_feature`, not failure (if in modelinfo file)	short description/error name	long description
incsbgrp	sequence	never	INCORRECT_SPECIFIED_SUBGROUP	score difference too large between best overall model and best specified subgroup model
incgroup	sequence	never	INCORRECT_SPECIFIED_GROUP	score difference too large between best overall model and best specified group model
lowcovrg	sequence	never	LOW_COVERAGE	low sequence fraction with significant similarity to homology model
dupregin	sequence	never	DUPLICATE_REGIONS	similarity to a model region occurs more than once
discontn	sequence	never	DISCONTINUOUS_SIMILARITY	not all hits are in the same order in the sequence and the homology model
indfstrn	sequence	never	INDEFINITE_STRAND	significant similarity detected on both strands
lowsim5s	sequence	never	LOW_SIMILARITY_START	significant similarity not detected at 5' end of the sequence
lowsim3s	sequence	never	LOW_SIMILARITY_END	significant similarity not detected at 3' end of the sequence
lowsimis	sequence	never	LOW_SIMILARITY	internal region without significant similarity
nmiscftr	sequence	never	TOO_MANY_MISC_FEATURES	too many features reported as misc_features
deletins	sequence	never	DELETION_OF_FEATURE	internal deletion of a complete feature
mutstart	feature	yes	MUTATION_AT_START	expected start codon could not be identified
mutendcd	feature	yes	MUTATION_AT_END	expected stop codon could not be identified, predicted CDS stop by homology is invalid
mutendns	feature	yes	MUTATION_AT_END	expected stop codon could not be identified, no in-frame stop codon exists 3' of predicted valid start codon
mutendex	feature	yes	MUTATION_AT_END	expected stop codon could not be identified, first in-frame stop codon exists 3' of predicted stop position
unexleng	feature	yes	UNEXPECTED_LENGTH	length of complete coding (CDS or mat_peptide) feature is not a multiple of 3
cdsstopn	feature	yes	CDS_HAS_STOP_CODON	in-frame stop codon exists 5' of stop position predicted by homology to reference
cdsstopp	feature	yes	CDS_HAS_STOP_CODON	stop codon in protein-based alignment
fsthicft	feature	yes	POSSIBLE_FRAMESHIFT_HIGH_CONF	high confidence possible frameshift in CDS (frame not restored before end) (not reported if `--glsearch`
fsthicfi	feature	yes	POSSIBLE_FRAMESHIFT_HIGH_CONF	high confidence possible frameshift in CDS (frame restored before end) (not reported if `--glsearch`)
fstukcf3	feature	yes	POSSIBLE_FRAMESHIFT	possible frameshift in CDS (frame not restored before end) (only reported if `--glsearch`)
fstukcfi	feature	yes	POSSIBLE_FRAMESHIFT	possible frameshift in CDS (frame restored before end) (only reported if `--glsearch`)
peptrans	feature	yes	PEPTIDE_TRANSLATION_PROBLEM	mat_peptide may not be translated because its parent CDS has a problem
pepadjcy	feature	yes	PEPTIDE_ADJACENCY_PROBLEM	predictions of two mat_peptides expected to be adjacent are not adjacent
indfantp	feature	no	INDEFINITE_ANNOTATION	protein-based search identifies CDS not identified in nucleotide-based search
indfantn	feature	no	INDEFINITE_ANNOTATION	nucleotide-based search identifies CDS not identified in protein-based search
indf5gap	feature	yes	INDEFINITE_ANNOTATION_START	alignment to homology model is a gap at 5' boundary
indf5lcn	feature	yes	INDEFINITE_ANNOTATION_START	alignment to homology model has low confidence at 5' boundary for feature that does not match a CDS
indf5plg	feature	yes	INDEFINITE_ANNOTATION_START	protein-based alignment extends past nucleotide-based alignment at 5' end
indf5pst	feature	yes	INDEFINITE_ANNOTATION_START	protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
indf3gap	feature	yes	INDEFINITE_ANNOTATION_END	alignment to homology model is a gap at 3' boundary
indf3lcn	feature	yes	INDEFINITE_ANNOTATION_END	alignment to homology model has low confidence at 3' boundary for feature that does not match a CDS
indf3plg	feature	yes	INDEFINITE_ANNOTATION_END	protein-based alignment extends past nucleotide-based alignment at 3' end
indf3pst	feature	yes	INDEFINITE_ANNOTATION_END	protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
indfstrp	feature	no	INDEFINITE_STRAND	strand mismatch between protein-based and nucleotide-based predictions
insertnp	feature	no	INSERTION_OF_NT	too large of an insertion in protein-based alignment
deletinp	feature	yes	DELETION_OF_NT	too large of a deletion in protein-based alignment
deletinf	feature	no	DELETION_OF_FEATURE_SECTION	internal deletion of a complete section in a multi-section feature with other section(s) annotated
lowsim5n	feature	no	LOW_FEATURE_SIMILARITY_START	region within annotated feature that does not match a CDS at 5' end of sequence lacks significant similarity
lowsim3n	feature	no	LOW_FEATURE_SIMILARITY_END	region within annotated feature that does not match a CDS at 3' end of sequence lacks significant similarity
lowsimin	feature	no	LOW_FEATURE_SIMILARITY	region within annotated feature that does not match a CDS lacks significant similarity

Description of alerts that are non-fatal by default

alert code	type	causes `misc_feature`, not failure (if in modelinfo file)	short description/error name	long description
qstsbgrp	sequence	never	QUESTIONABLE_SPECIFIED_SUBGROUP	best overall model is not from specified subgroup
qstgroup	sequence	never	QUESTIONABLE_SPECIFIED_GROUP	best overall model is not from specified group
ambgnt5s	sequence	never	AMBIGUITY_AT_START	first nucleotide of the sequence is an ambiguous nucleotide
ambgnt3s	sequence	never	AMBIGUITY_AT_END	final nucleotide of the sequence is an ambiguous nucleotide
indfclas	sequence	never	INDEFINITE_CLASSIFICATION	low score difference between best overall model and second best model (not in best model's subgroup)
lowscore	sequence	never	LOW_SCORE	score to homology model below low threshold
biasdseq	sequence	never	BIASED_SEQUENCE	high fraction of score attributed to biased sequence composition
unjoinbl	sequence	never	UNJOINABLE_SUBSEQ_ALIGNMENTS	inconsistent alignment of overlapping region between ungapped seed and flanking region
deletina	sequence	never	DELETION_OF_FEATURE	allowed internal deletion of a complete feature (feature with `is_deletable` flag set to `1` in `.minfo` file)
ambgntrp	sequence	never	N_RICH_REGION_NOT_REPLACED	N-rich region of unexpected length not replaced during N replacement region (only possibly reported if `-r`)
fstlocft	feature	yes	POSSIBLE_FRAMESHIFT_LOW_CONF	low confidence possible frameshift in CDS (frame not restored before end) (not reported if `--glsearch`)
fstlocfi	feature	yes	POSSIBLE_FRAMESHIFT_LOW_CONF	low confidence possible frameshift in CDS (frame restored before end) (not reported if `--glsearch`)
indf5lcc	feature	yes	INDEFINITE_ANNOTATION_START	alignment to homology model has low confidence at 5' boundary for feature that is or matches a CDS
indf3lcc	feature	yes	INDEFINITE_ANNOTATION_END	alignment to homology model has low confidence at 3' boundary for feature that is or matches a CDS
insertnn	feature	no	INSERTION_OF_NT	too large of an insertion in nucleotide-based alignment of CDS feature
deletinn	feature	yes	DELETION_OF_NT	too large of a deletion in nucleotide-based alignment of CDS feature
lowsim5c	feature	no	LOW_FEATURE_SIMILARITY_START	region within annotated feature that is or matches a CDS at 5' end of sequence lacks significant similarity
lowsim3c	feature	no	LOW_FEATURE_SIMILARITY_END	region within annotated feature that is or matches a CDS at 3' end of sequence lacks significant similarity
lowsimic	feature	no	LOW_FEATURE_SIMILARITY	region within annotated feature that is or matches a CDS lacks significant similarity
ambgnt5f	feature	no	AMBIGUITY_AT_FEATURE_START	first nucleotide of non-CDS feature is an ambiguous nucleotide
ambgnt3f	feature	no	AMBIGUITY_AT_FEATURE_END	final nucleotide of non-CDS feature is an ambiguous nucleotide
ambgnt5c	feature	no	AMBIGUITY_AT_CDS_START	first nucleotide of CDS is an ambiguous nucleotide
ambgnt3c	feature	no	AMBIGUITY_AT_CDS_END	final nucleotide of CDS is an ambiguous nucleotide
ambgcd5c	feature	no	AMBIGUITY_IN_START_CODON	5' complete CDS starts with canonical nt but includes ambiguous nt in its start codon
ambgcd3c	feature	no	AMBIGUITY_IN_STOP_CODON	3' complete CDS ends with canonical nt but includes ambiguous nt in its stop codon

Additional information on `v-annotate.pl` alerts

The table below has additional information on the alerts not contained in the --alt_list output. The "relevant_options" column lists command-line options that pertain to each alert. The [relevant feature types column shows which feature types each alert can be reported for (this field is "-" for alerts that pertain to a sequence instead of a feature). The omitted in .tbl and .alt.list by column lists other alerts that, if present, will cause this alert to be omitted in the .tbl and .alt.list files to reduce redundant information reported to user, this is "-" for alerts that are never omitted from those files.

More information on always fatal alert codes

alert code	short description/error name	relevant options	relevant feature types	omitted in `.tbl` and `.alt.list` by
noannotn	NO_ANNOTATION	none	-	-
revcompl	REVCOMPLEM	none	-	-
unexdivg	UNEXPECTED_DIVERGENCE	none	-	-
noftrann	NO_FEATURES_ANNOTATED	none	-	-
noftrant	NO_FEATURES_ANNOTATED	none	-	-
ftskipfl	UNREPORTED_FEATURE_PROBLEM	none	-	-

More information on alerts that are fatal by default

alert code	short description/error name	relevant_options	relevant feature types	omitted in `.tbl` and `.alt.list` by
incsbgrp	INCORRECT_SPECIFIED_SUBGROUP	`--incspec`	-	-
incgroup	INCORRECT_SPECIFIED_GROUP	`--incspec`	-	-
lowcovrg	LOW_COVERAGE	`--lowcov`	-	-
dupregin	DUPLICATE_REGIONS	`--dupreg`	-	-
discontn	DISCONTINUOUS_SIMILARITY	none	-	-
indfstrn	INDEFINITE_STRAND	`--indefstr`	-	-
lowsim5s	LOW_SIMILARITY_START	`--lowsim5seq`	-	-
lowsim3s	LOW_SIMILARITY_END	`--lowsim3seq`	-	-
lowsimis	LOW_SIMILARITY	`--lowsimint`	-	-
nmiscftr	TOO_MANY_MISC_FEATURES	`--nmiscftrthr`	all	-
deletins	DELETION_OF_FEATURE	none	all	-
mutstart	MUTATION_AT_START	`--atgonly`	CDS	-
mutendcd	MUTATION_AT_END	none	CDS	cdsstopn, mutendex, mutendns
mutendns	MUTATION_AT_END	none	CDS	-
mutendex	MUTATION_AT_END	none	CDS	-
unexleng	UNEXPECTED_LENGTH	none	CDS, mat_peptide	-
cdsstopn	CDS_HAS_STOP_CODON	none	CDS	-
cdsstopp	CDS_HAS_STOP_CODON	none	CDS	-
fsthicft	POSSIBLE_FRAMESHIFT_HIGH_CONF	`--fsthighthr`, `--fstminntt`	CDS	-
fsthicfi	POSSIBLE_FRAMESHIFT_HIGH_CONF	`--fsthighthr`, `--fstminnti`	CDS	-
fstukcft	POSSIBLE_FRAMESHIFT	`--glsearch`, `--fstminntt`	CDS	-
fstukcfi	POSSIBLE_FRAMESHIFT	`--glsearch`, `--fstminnti`	CDS	-
peptrans	PEPTIDE_TRANSLATION_PROBLEM	none	mat_peptide	-
pepadjcy	PEPTIDE_ADJACENCY_PROBLEM	none	mat_peptide	-
indfantp	INDEFINITE_ANNOTATION	`--xlonescore`	CDS	-
indfantn	INDEFINITE_ANNOTATION	none	CDS	-
indf5gap	INDEFINITE_ANNOTATION_START	none	all	-
indf5lcn	INDEFINITE_ANNOTATION_START	`--indefann`, `--indefann_mp`	all except CDS and any gene or mat_peptide with identical start coordinate to a CDS	-
indf5plg	INDEFINITE_ANNOTATION_START	none	CDS	-
indf5pst	INDEFINITE_ANNOTATION_START	`--xalntol`	CDS	-
indf3gap	INDEFINITE_ANNOTATION_END	none	all	-
indf3lcn	INDEFINITE_ANNOTATION_END	`--indefann`, `--indefann_mp`	all except CDS and any gene with identical stop coordinate to CDS	-
indf3plg	INDEFINITE_ANNOTATION_END	none	CDS	-
indf3pst	INDEFINITE_ANNOTATION_END	`--xalntol`	CDS	-
indfstrp	INDEFINITE_STRAND	none	CDS	-
insertnp	INSERTION_OF_NT	`--xmaxins`	CDS	-
insertnn	INSERTION_OF_NT	`--nmaxins`	CDS	-
deletinp	DELETION_OF_NT	`--xmaxdel`	CDS	-
deletinf	DELETION_OF_FEATURE_SECTION	none	all	-
lowsim5n	LOW_FEATURE_SIMILARITY_START	`--lowsim5ftr`	all except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide	-
lowsim3n	LOW_FEATURE_SIMILARITY_END	`--lowsim3ftr`	all except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide	-
lowsimin	LOW_FEATURE_SIMILARITY	`--lowsimiftr`	all except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide	-

More information on alerts that are non-fatal by default

alert code	short description/error name	relevant_options	relevant feature types	omitted in `.tbl` and `.alt.list` by
qstsbgrp	QUESTIONABLE_SPECIFIED_SUBGROUP	none	-	-
qstgroup	QUESTIONABLE_SPECIFIED_GROUP	none	-	-
ambgnt5s	AMBIGUITY_AT_START	none	-	-
ambgnt3s	AMBIGUITY_AT_END	none	-	-
indfclas	INDEFINITE_CLASSIFICATION	`--indefclas`	-	-
lowscore	LOW_SCORE	`--lowsc`	-	-
biasdseq	BIASED_SEQUENCE	`--biasfrac`	-	-
unjoinbl	UNJOINABLE_SUBSEQ_ALIGNMENTS	none	-
deletina	DELETION_OF_FEATURE	`--ignore_isdel`	all	-
ambgntrp	N_RICH_DELETION_OF_FEATURE	`--r_diffno`, `--r_diffmaxdel`, `--r_diffmaxins`, `--r_diffminnonn`, `--r_diffminfract`	all	-
fstlocft	POSSIBLE_FRAMESHIFT_LOW_CONF	`--fstlothr`, `--fstminntt`	CDS	-
fstlocfi	POSSIBLE_FRAMESHIFT_LOW_CONF	`--fstlothr`, `--fstminnti`	CDS	-
indf5lcc	INDEFINITE_ANNOTATION_START	`--indefann`, `--indefann_mp`	CDS and any gene or mat_peptide with identical start coordinate to a CDS	-
indf3lcc	INDEFINITE_ANNOTATION_END	`--indefann`, `--indefann_mp`	CDS and any gene with identical stop coordinate to CDS	-
insertnn	INSERTION_OF_NT	`--nmaxins`	CDS	-
deletinn	DELETION_OF_NT	`--nmaxdel`	CDS	-
lowsim5c	LOW_FEATURE_SIMILARITY_START	`--lowsim5ftr`	CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide	-
lowsim3c	LOW_FEATURE_SIMILARITY_END	`--lowsim3ftr`	CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide	-
lowsimic	LOW_FEATURE_SIMILARITY	`--lowsimiftr`	CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide	-
ambgnt5f	AMBIGUITY_AT_FEATURE_START	none	-	-
ambgnt3f	AMBIGUITY_AT_FEATURE_END	none	-	-
ambgnt5c	AMBIGUITY_AT_CDS_START	none	CDS	-
ambgnt3c	AMBIGUITY_AT_CDS_END	none	CDS	-
ambgcd5c	AMBIGUITY_IN_START_CODON	none	CDS	-
ambgcd3c	AMBIGUITY_IN_STOP_CODON	none	CDS	-

Expendable features: allowing sequences to pass despite fatal alerts for specific features

It is possible to specify that certain features are expendable and so have relaxed requirements. Some alerts that are normally fatal are not fatal for expendable features. If any such alerts are reported for an expendable feature that feature will be turned into a misc_feature in the output feature table .pass.tbl file, but the sequence will still pass, as long as it has zero fatal alerts for all non-expendable features and zero fatal sequence alerts.

The default set of specific alerts that an expendable feature can have without failing its sequence are listed with 'yes' in the 'causes misc_feature, not failure (if in modelinfo file)' column in the tables describing alerts above as well as in the --alt_list output. This set can be changed using the --alt_mnf_yes <s1> option to specify that alert codes in the comma-separated string <s1> be added to the set, and the --alt_mnf_no <s2> option to specify that alert codes in the comma-separated string <s2> be removed from the set.

Expendable features are specified in the .modelinfo file, with a key/value pair string: misc_not_feature:"1" in the FEATURE line for the corresponding feature.

For example, the sequence JN975492.1 is the one sequence in the example above that fails. It matches best to the NC_008311 model. It fails due to the fatal alerts mutendcd, cdsstopn, cdsstopp, and indf3pst for the VF1 CDS feature, and indf5pst fatal alert for the VP2 CDS as shown above. If the VF1 and VP2 features were defined as expendable using the misc_not_failure:"1" key/value pair in the .minfo file as they are in the included example file vadr.mnf-example.minfo, then the sequence would have passed.

The relevant excerpt from the $VADRSCRIPTSDIR/documentation/annotate-files/vadr.mnf-example.minfo file:

FEATURE NC_008311 type:"gene" coords:"5069..5710:+" parent_idx_str:"GBNULL" gene:"ORF4" misc_not_failure:"1"
FEATURE NC_008311 type:"CDS" coords:"5069..5710:+" parent_idx_str:"GBNULL" gene:"ORF4" product:"VF1" misc_not_failure:"1"
FEATURE NC_008311 type:"gene" coords:"6681..7307:+" parent_idx_str:"GBNULL" gene:"ORF3" misc_not_failure:"1"
FEATURE NC_008311 type:"CDS" coords:"6681..7307:+" parent_idx_str:"GBNULL" gene:"ORF3" product:"VP2" misc_not_failure:"1"

Note that in addition to the two CDS features, the two gene features that correspond them also have misc_not_failure:"1" key/value pairs. When a CDS is made expendable, it often makes sense to make any corresponding gene features expendable too. However, gene features are an exception in that they do not get turned into a misc_feature if they have alerts that are normally fatal, as per GenBank convention, but it is still relevant to mark them as expendable because some alerts in them will not cause the sequence to fail.

To rerun the example using this new .minfo file, execute:

v-annotate.pl -i $VADRSCRIPTSDIR/documentation/annotate-files/vadr.mnf-example.minfo $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-mnf-noro.9

The output will indicate that all sequences now pass:

# Summary of classified sequences:
#
#                                      num   num   num
#idx  model      group      subgroup  seqs  pass  fail
#---  ---------  ---------  --------  ----  ----  ----
1     NC_008311  Norovirus  GV           2     2     0
2     NC_029645  Norovirus  GIII         2     2     0
3     NC_039477  Norovirus  GII          2     2     0
4     NC_044854  Norovirus  GI           2     2     0
5     NC_001959  Norovirus  GI           1     1     0
#---  ---------  ---------  --------  ----  ----  ----
-     *all*      -          -            9     9     0
-     *none*     -          -            0     0     0
#---  ---------  ---------  --------  ----  ----  ----

But the output .pass.tbl will not include the VF1 and VP2 CDS features for JN975492.1, instead it will include misc_feature features with note qualifiers that indicate the regions are similar to their respective CDS:

Relevant excerpt from va-mnf-noro.9.vadr.pass.tbl:

5044	5685	gene
			gene	ORF4
5044	5685	misc_feature
			note	similar to VF1
6656	7282	gene
			gene	ORF3
6656	7282	misc_feature
			note	similar to VP2

Note that if only the VF1 CDS or VP2 CDS feature lines included the misc_not_failure:"1" key/value pairs in the modelinfo file, the sequence would have failed.

Two important caveats above expendable features and misc_feature-ization:

As mentioned above, features with type gene, 5'UTR, 3'UTR or operon are never converted to misc_feature values as per GenBank convention.
misc_feature-ization occurs in .pass.tbl output files for expendable features as explained above even when the option --nomisc is used. (The --nomisc option causes misc_features not to be reported in .fail.tbl files.)

Limiting memory usage and parallelization with multi-threading

The v-annotate.pl script, in particular the alignment step, is memory intensive. For Norovirus and Dengue virus, it is recommended to have 16G of RAM available. For larger viruses, such as the roughly 30Kb SARS-CoV-2 virus, 64G of available RAM is recommended. However, the --glsearch and --split options can be used to reduce the memory requirements.

The --glsearch option causes the glsearch program from the FASTA software package to be used instead of Infernal's memory intensive cmalign program. However, --glsearch has only been extensively tested for SARS-CoV-2 sequences, for which it is now recommended due to the high 64G memory recommendation with cmalign.

With --glsearch the amount of required memory is roughly 2G of RAM for small input fasta files with 2000 sequences or less, but can exceed 2G for very large input files. Required memory will increase with the size of the input file.

Using the --split option removes the dependence of required memory on input file size as it causes splitting of the input fasta file into independent chunks with each chunk processed separately and results from all chunks combined at the end.

Also, in combination with the --glsearch and --split options, the user can specify multi-threading with <n> CPUs by using the --cpu <n> option. It is recommended that at least 2G * <n> total RAM is available when using this option.

In summary, the following combination of options are recommended to reduce memory usage and speed-up processing for SARS-CoV-2 annotation, provided you are running on a machine with 8 available cores and 16G of total RAM: --glsearch --split --cpu 8.

For more information on SARS-CoV-2 annotation with VADR see https://github.com/ncbi/vadr/wiki/Coronavirus-annotation

Alternative parallelization using a cluster

Alternatively, if you have access to a cluster and want to parallelize but do not want to use glsearch, you can use the -p option. Importantly, the -p option will not work with --glsearch and will not reduce the memory requirements like --glsearch does.

Using -p will parallelize the most time-consuming stages of v-annotate.pl (classification, coverage determination and alignment) on a cluster by splitting up the input sequence file randomly into multiple files, and running each as a separate job. This is most beneficial for large input sequence files.

With -p, by default, v-annotate.pl will consult the file $VADRSCRIPTSDIR/vadr.qsubinfo to read the command prefix and suffix for submitting jobs to the cluster. This file is set up to use Univa Grid Engine (UGE 8.5.5) and specific flags used on the NCBI system, but you can either modify this file to work with your own cluster or create a new file <s> and use the option -q <s> to read that file. The $VADRSCRIPTSDIR/vadr.qsubinfo has comments at the top that explain the format of the file. Email eric.nawrocki@nih.gov for help.

To repeat the above v-annotate.pl run in the example usage section, use this command:

v-annotate.pl -p $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-parallel-noro.9

Usage of -p will not affect the output of v-annotate.pl other than these lines about the status of jobs, but it can make processing of large sequence files significantly faster depending on how busy the cluster is.

Files

annotate.md

Latest commit

History

annotate.md

File metadata and controls

v-annotate.pl example usage, command-line options and alert information