v-annotate.pl
example usagev-annotate.pl
command-line options- basic options
- options for specifying expected sequence classification
- options for controlling which alerts are fatal
- options related to model files
- options for controlling output feature table
- options for controlling alert thresholds
- options for controlling the alignment stage
- options for controlling the blastx protein validation stage
- options for using hmmer instead of blastx for protein validation
- options related to blastn-based seeded alignment acceleration strategy
- options related to pre-processing to replace Ns with expected nucleotides
- options related to splitting input fasta file and multithreading
- options related to parallelization on a compute farm/cluster
- options related to both splitting input and parallelization on a compute farm/cluster
- options for skipping stages
- options for additional output files
- additional expert options
- Basic Information on
v-annotate.pl
alerts - Additional information on
v-annotate.pl
alerts - Expendable features: allowing sequences to pass despite fatal alerts for specific features
- Limiting memory usage and multi-threading
- Alternative parallelization using a cluster
v-annotate.pl
uses previously created VADR models from v-build.pl
to analyze and annotate sequences in an input sequence file. As part
of the analysis of the sequences, more than 40 types of unexpected
characteristics, or alerts are detected and reported in the
output. Most of the alerts are fatal in that if a sequence has one
or more fatal alerts, they will be designated as failing
sequences. Sequences with zero fatal alerts are designated as
passing sequences. The types of alerts are described further below.
NOTE: the examples below are for norovirus and demonstrate the typical usage of vadr. For examples specific to SARS-CoV-2 see:
https://github.com/ncbi/vadr/wiki/Coronavirus-annotation
To determine the command-line usage of
v-annotate.pl
(or any VADR script), use the -h
option, like this:
v-annotate.pl -h
You'll see something like the following output:
# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.4 (Dec 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Thu Dec 16 09:38:22 2021
#
Usage: v-annotate.pl [-options] <fasta file to annotate> <output directory to create>
The first few lines are the banner which show the name of the VADR
script being run along with the version and release date. This is
followed by the time and date the command was executed. The Usage:
line details the expected command line arguments. v-annotate.pl
takes
as input two command line arguments, a fasta file with sequences to
analyze and annotate (<fasta file to annotate>
) and and the name of the output directory you
want it to create (<output directory to create>
) and
populate with output files.
After that comes a list of all available command-line options. These are explained in more detail below.
Here is an example v-annotate.pl
command
using the fasta file
vadr/documentation/annotate-files/noro.9.fa
with 9 norovirus sequences, and creating an output directory with
va-noro.9
:
v-annotate.pl $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-noro.9
The standard output of v-annotate.pl
that is printed to the screen
(which is also output to the .log
output file) begins with the
banner and date again followed by a list of relevant environment
variables, the command line arguments used and any command line
options used:
# date: Thu Dec 16 09:38:53 2021
# $VADRBIOEASELDIR: /home/nawrocki/vadr-install-dir/Bio-Easel
# $VADRBLASTDIR: /home/nawrocki/vadr-install-dir/ncbi-blast/bin
# $VADREASELDIR: /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRINFERNALDIR: /home/nawrocki/vadr-install-dir/infernal/binaries
# $VADRMODELDIR: /home/nawrocki/vadr-install-dir/vadr-models-calici
# $VADRSCRIPTSDIR: /home/nawrocki/vadr-install-dir/vadr
#
# sequence file: /home/nawrocki/vadr-install-dir/vadr/documentation/annotate-files/noro.9.fa
# output directory: va-noro.9
No command line options were used in our example output, but if they
were information on them would have appeared after the output directory
line.
Next, information is output about each step the script is proceeding through. When each step is completed, the elapsed time for that step is output.
v-annotate.pl
will use the default VADR model library (CM file
$VADRMODELDIR/vadr.cm
, model info file $VADRMODELDIR/vadr.minfo
and BLAST DBs in $VADRMODELDIR) to analyze the sequences in
noro.9.fa
, and will create an output directory named va-noro.9
and
populate it with many output files.
There are four main stages to v-annotate.pl
-
Classification: each sequence
S
in the input file is compared against all models in the model library to determine the best-matching (highest-scoring) modelM(S)
for each sequence. -
Coverage determination: each sequence
S
is compared again toM(S)
to determine the coverage ofS
, which is basically the fraction of nucleotides inS
that appear to be homologous (sufficiently similar) toM(S)
. -
Alignment and feature mapping: each sequence
S
is aligned toM(S)
and based on that alignment, features ofM(S)
are mapped ontoS
. -
Protein validation of CDS features: for each sequence
S
that has 1 or more predicted CDS features,blastx
orhmmsearch
is used to compare the predicted CDS and the full sequenceS
against the VADR library BLAST or HMM DB.
The output of v-annotate.pl
lists one or more steps per stage. The
first two steps are:
# Validating input ... done. [ 0.7 seconds]
# Classifying sequences (9 seqs) ... done. [ 31.4 seconds]
The first step validates that the VADR library .minfo
file being
used corresponds to the VADR library .cm
file. Next, the
classification stage is performed. After that, each model that is
M(S)
for at least one S
is run separately through the coverage
determination stage for all of its sequences:
# Determining sequence coverage (NC_001959: 1 seq) ... done. [ 1.0 seconds]
# Determining sequence coverage (NC_008311: 2 seqs) ... done. [ 2.9 seconds]
# Determining sequence coverage (NC_029645: 2 seqs) ... done. [ 1.1 seconds]
# Determining sequence coverage (NC_039477: 2 seqs) ... done. [ 3.0 seconds]
# Determining sequence coverage (NC_044854: 2 seqs) ... done. [ 0.9 seconds]
Next, the alignments are performed for each model, and used to map feature annotation:
# Aligning sequences (NC_001959: 1 seq) ... done. [ 0.8 seconds]
# Aligning sequences (NC_008311: 2 seqs) ... done. [ 10.9 seconds]
# Aligning sequences (NC_029645: 2 seqs) ... done. [ 1.5 seconds]
# Aligning sequences (NC_039477: 2 seqs) ... done. [ 11.6 seconds]
# Aligning sequences (NC_044854: 2 seqs) ... done. [ 0.9 seconds]
# Determining annotation ... done. [ 0.6 seconds]
The classification and alignment stages are typically the slowest. The protein validation stage is usually relatively fast:
# Validating proteins with blastx (NC_001959: 1 seq) ... done. [ 2.0 seconds]
# Validating proteins with blastx (NC_008311: 2 seqs) ... done. [ 1.1 seconds]
# Validating proteins with blastx (NC_029645: 2 seqs) ... done. [ 0.8 seconds]
# Validating proteins with blastx (NC_039477: 2 seqs) ... done. [ 1.0 seconds]
# Validating proteins with blastx (NC_044854: 2 seqs) ... done. [ 0.7 seconds]
The only remaining steps are to create the output files:
# Generating feature table output ... done. [ 0.0 seconds]
# Generating tabular output ... done. [ 0.0 seconds]
After the output files are generated, a summary of the results is output, starting with a list of all models that were the best matching model to one or more sequences:
# Summary of classified sequences:
#
# num num num
#idx model group subgroup seqs pass fail
#--- --------- --------- -------- ---- ---- ----
1 NC_008311 Norovirus GV 2 1 1
2 NC_029645 Norovirus GIII 2 2 0
3 NC_039477 Norovirus GII 2 2 0
4 NC_044854 Norovirus GI 2 2 0
5 NC_001959 Norovirus GI 1 1 0
#--- --------- --------- -------- ---- ---- ----
- *all* - - 9 8 1
- *none* - - 0 0 0
#--- --------- --------- -------- ---- ---- ----
The above table is also saved to a file with the suffix .mdl
,
and the format is described in more detail here.
The nine input sequences collectively had five different best-matching
models, all Norovirus models with the subgroups shown above. (The
groups and subgroups shown above were input as command-line
options to v-build.pl
when the models
were built. An example is here. Eight of
the nine sequences passed and one failed.
Next, the alerts reported for the sequences are summarized:
# Summary of reported alerts:
#
# alert causes short per num num long
#idx code failure description type cases seqs description
#--- -------- ------- --------------------------- ------- ----- ---- -----------
1 mutendcd yes MUTATION_AT_END feature 1 1 expected stop codon could not be identified, predicted CDS stop by homology is invalid
2 cdsstopn yes CDS_HAS_STOP_CODON feature 1 1 in-frame stop codon exists 5' of stop position predicted by homology to reference
3 cdsstopp yes CDS_HAS_STOP_CODON feature 1 1 stop codon in protein-based alignment
4 indf5pst yes INDEFINITE_ANNOTATION_START feature 1 1 protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
5 indf3pst yes INDEFINITE_ANNOTATION_END feature 1 1 protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
#--- -------- ------- --------------------------- ------- ----- ---- -----------
For the nine sequences, there were five different alert codes reported, each with one case for a single sequence. This tabular summary does not indicate to which sequence(s) each alert pertains. We can find that out from other output files as discussed further below.
Next, the list of output files created by v-annotate.pl
is
printed, along with elapsed time:
# Output printed to screen saved in: va-noro.9.vadr.log
# List of executed commands saved in: va-noro.9.vadr.cmd
# List and description of all output files saved in: va-noro.9.vadr.filelist
# esl-seqstat -a output for input fasta file saved in: va-noro.9.vadr.seqstat
# 5 column feature table output for passing sequences saved in: va-noro.9.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in: va-noro.9.vadr.fail.tbl
# list of passing sequences saved in: va-noro.9.vadr.pass.list
# list of failing sequences saved in: va-noro.9.vadr.fail.list
# list of alerts in the feature tables saved in: va-noro.9.vadr.alt.list
# fasta file with passing sequences saved in: va-noro.9.vadr.pass.fa
# fasta file with failing sequences saved in: va-noro.9.vadr.fail.fa
# per-sequence tabular annotation summary file saved in: va-noro.9.vadr.sqa
# per-sequence tabular classification summary file saved in: va-noro.9.vadr.sqc
# per-feature tabular summary file saved in: va-noro.9.vadr.ftr
# per-model-segment tabular summary file saved in: va-noro.9.vadr.sgm
# per-model tabular summary file saved in: va-noro.9.vadr.mdl
# per-alert tabular summary file saved in: va-noro.9.vadr.alt
# alert count tabular summary file saved in: va-noro.9.vadr.alc
# alignment doctoring tabular summary file saved in: va-noro.9.vadr.dcr
#
# All output files created in directory ./va-noro.9/
#
# Elapsed time: 00:01:14.88
# hh:mm:ss
#
[ok]
All of these files were created in the newly created directory
va-noro.9
. What follows is a brief discussion of some of these
output file types. More information on these files and their formats
can be found here.
The first three files are the .log
file, which is
the same as the standard output printed to the screen currently being
discussed, the .cmd
file, and the .filelist
file which lists the output files created by
v-annotate.pl
. Next comes a .seqstat
file
with lengths for each sequence in the input file.
v-annotate.pl
also creates
5-column tab-delimited feature table files that end with the suffix
.tbl
. There is a separate file for passing
(va-noro9.vadr.pass.tbl
) and failing (va-noro9.vadr.fail.tbl
)
sequences.
The format of the .tbl
files is described here:
https://www.ncbi.nlm.nih.gov/Sequin/table.html.
These files contain some of the same information as the .ftr
files
discussed above, with some additional information on qualifiers for
features read from the model info file.
From the va-noro9.vadr.pass.tbl
file, the feature table for
KY887602
is:
>Feature KY887602.1
<1 5083 gene
gene ORF1
<1 5083 CDS
product nonstructural polyprotein
codon_start 2
protein_id KY887602.1_1
<1 979 mat_peptide
product p48
protein_id KY887602.1_1
980 2077 mat_peptide
product NTPase
protein_id KY887602.1_1
2078 2608 mat_peptide
product p22
protein_id KY887602.1_1
2609 3007 mat_peptide
product VPg
protein_id KY887602.1_1
3008 3550 mat_peptide
product Pro
protein_id KY887602.1_1
3551 5080 mat_peptide
product RdRp
protein_id KY887602.1_1
5064 6686 gene
gene ORF2
5064 6686 CDS
product VP1
protein_id KY887602.1_2
6686 7492 gene
gene ORF3
6686 7492 CDS
product VP2
protein_id KY887602.1_3
In the .fail.tbl
file, each sequence's feature table contains a
special section at the end of the table that lists errors
due to reported fatal alerts. For example, at the end of
va-noro9.vadr.fail.tbl
:
Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:VF1) in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]; seq-coords:5275..5277:+; mdl-coords:5300..5302:+; mdl:NC_008311;
ERROR: CDS_HAS_STOP_CODON: (CDS:VF1) stop codon in protein-based alignment [-]; seq-coords:5275..5277:+; mdl-coords:5300..5302:+; mdl:NC_008311;
ERROR: INDEFINITE_ANNOTATION_END: (CDS:VF1) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]; seq-coords:5650..5685:+; mdl-coords:5710..5710:+; mdl:NC_008311;
ERROR: INDEFINITE_ANNOTATION_START: (CDS:VP2) protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]; seq-coords:6656..6709:+; mdl-coords:6681..6681:+; mdl:NC_008311;
Note that this file lists only four ERRORs while the .alt
output
file below lists five alerts. Not all fatal alerts will be
printed to this .fail.tbl
file, because when specific pairs of
alerts occur, only one is output to reduce the number of overlapping
or redundant problems reported to the submitter/user. In this case the
mutendcd
(MUTATION_AT_END
) alert is omitted in the .fail.tbl
file because it occurs in combination with a cdsstopn
(CDS_HAS_STOP_CODON
) alert. But this is rare, as only one alert
(mutendcd
) can possibly be omitted. See the
far right column of the this table for which alerts,
when present in combination with mutendcd
causes it
to be omitted. Only alerts output to the .fail.tbl
table are output to the .alt.list
file.
The next three output files all end in .list
. The .pass.list
and
.fail.list
files are simply lists of all passing and failing
sequences, respectively, with each sequence listed on a separate
line. The .alt.list
file lists all alerts that cause errors in the
.fail.tbl
file in a four field tab-delimited format described
here. The va-noro9.vadr.alt.list
file lists
the four alerts/errors in the .fail.tbl
file shown above:
#sequence model feature-type feature-name error seq-coords mdl-coords error-description
JN975492.1 NC_008311 CDS VF1 CDS_HAS_STOP_CODON 5275..5277:+ 5300..5302:+ in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]
JN975492.1 NC_008311 CDS VF1 CDS_HAS_STOP_CODON 5275..5277:+ 5300..5302:+ stop codon in protein-based alignment [-]
JN975492.1 NC_008311 CDS VF1 INDEFINITE_ANNOTATION_END 5650..5685:+ 5710..5710:+ protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]
JN975492.1 NC_008311 CDS VP2 INDEFINITE_ANNOTATION_START 6656..6709:+ 6681..6681:+ protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]
After that are two FASTA-formatted sequence files.
One of these files includes all
passing sequences (va-noro.9.vadr.pass.fa
) and the other includes all
failing sequences (va-noro.9.vadr.fail.fa
).
After these two FASTA files are eight tabular summary files that end with three letter suffixes:
suffix | description | reference |
---|---|---|
.alc |
per-alert code information (counts) | description of format |
.alt |
per-alert instance information | description of format |
.ftr |
per-feature information | description of format |
.mdl |
per-model information | description of format |
.sgm |
per-segment information | description of format |
.sqa |
per-sequence annotation information | description of format |
.sqc |
per-sequence classification information | description of format |
.dcr |
alignment doctoring information | description of format |
The contents of the .mdl
and .alc
files were already output by
v-annotate.pl
as covered above. To get more information on each
sequence, see the .sqa
and .sqc
files. The .sqc
file
(va-noro.9.vadr.sqc
) includes
information on the classification of each sequence:
#seq seq seq sub seq mdl num sub score diff/ seq
#idx name len p/f ant model1 grp1 grp1 score sc/nt cov cov bias hits str model2 grp2 grp2 diff nt alerts
#--- ---------- ---- ---- --- --------- --------- ---- ------ ----- ----- ----- ---- ---- --- --------- --------- ----- ------ ----- ------
1 KY887602.1 7547 PASS yes NC_039477 Norovirus GII 8142.8 1.079 1.000 0.997 12.5 1 + NC_044046 Norovirus GVIII 4348.2 0.576 -
2 KT818729.1 243 PASS yes NC_044854 Norovirus GI 170.4 0.701 0.996 0.031 0 1 + NC_044932 Norovirus GII 90.0 0.370 -
3 EU437710.1 291 PASS yes NC_001959 Norovirus GI 249.4 0.857 1.000 0.038 0 1 + NC_044855 Norovirus GIV 161.5 0.555 -
4 DQ288307.1 1094 PASS yes NC_029645 Norovirus GIII 973.4 0.890 1.000 0.150 6.2 1 + NC_001959 Norovirus GI 671.1 0.613 -
5 AY237464.1 255 PASS yes NC_039477 Norovirus GII 221.1 0.867 1.000 0.034 0 1 + NC_044047 Norovirus GVII 113.5 0.445 -
6 KF475958.1 275 PASS yes NC_044854 Norovirus GI 248.0 0.902 1.000 0.036 0 1 + NC_044932 Norovirus GII 164.0 0.596 -
7 AB713840.1 347 PASS yes NC_008311 Norovirus GV 330.6 0.953 1.000 0.047 0.2 1 + NC_040876 Norovirus GII 239.5 0.690 -
8 JN585032.1 286 PASS yes NC_029645 Norovirus GIII 242.3 0.847 0.997 0.039 0.1 1 + NC_039897 Norovirus GI 154.8 0.541 -
9 JN975492.1 7286 FAIL yes NC_008311 Norovirus GV 4666.2 0.640 1.000 0.987 16.8 1 + NC_044047 Norovirus GVII 3382.8 0.464 -
This file includes per-sequence information on whether each sequence
passed or failed, its best and second-best matching model, and scores
and coverage. The difference in score between the best and second-best
model gives an indication of how confidently classified the sequence
is to the best-matching model. (The qstgroup
and qstsbgrp
alerts
are reported if these scores are not sufficiently far apart to alert
the user that a sequence is not confidently classified.) Note that the
sequence JN975492.1
is the only sequence that failed. It has no seq alerts
listed in the final field, so the fatal alerts that caused it
to fail must have been per-feature alerts. These can be seen in the
.ftr
and .alt
files. An explanation of all fields in the .sqc
file type is here.
Next, take a look at the first few lines of the .ftr
file (va-noro.9.vadr.ftr
):
# seq seq ftr ftr ftr ftr par seq model ftr
#idx name len p/f model type name len idx idx str n_from n_to n_instp trc 5'N 3'N p_from p_to p_instp p_sc nsa nsn coords coords alerts
#--- ---------- ---- ---- --------- ----------- ------------------------- ---- --- --- --- ------ ---- ------- ----- --- --- ------ ---- ---------------- ---- --- --- ------------ ------------ ------
1.1 KY887602.1 7547 PASS NC_039477 gene ORF1 5083 1 -1 + 1 5083 - 5' 0 0 - - - - 1 0 1..5083:+ 22..5104:+ -
1.2 KY887602.1 7547 PASS NC_039477 CDS nonstructural_polyprotein 5083 2 -1 + 1 5083 - 5' 0 0 2 5080 - 9094 1 0 1..5083:+ 22..5104:+ -
1.3 KY887602.1 7547 PASS NC_039477 gene ORF2 1623 3 -1 + 5064 6686 - no 0 0 - - - - 1 0 5064..6686:+ 5085..6707:+ -
1.4 KY887602.1 7547 PASS NC_039477 CDS VP1 1623 4 -1 + 5064 6686 - no 0 0 5064 6683 - 2878 1 0 5064..6686:+ 5085..6707:+ -
1.5 KY887602.1 7547 PASS NC_039477 gene ORF3 807 5 -1 + 6686 7492 - no 0 0 - - - - 1 0 6686..7492:+ 6707..7513:+ -
1.6 KY887602.1 7547 PASS NC_039477 CDS VP2 807 6 -1 + 6686 7492 - no 0 0 6686 7486 - 1393 1 0 6686..7492:+ 6707..7513:+ -
1.7 KY887602.1 7547 PASS NC_039477 mat_peptide p48 979 7 2 + 1 979 - 5' 0 0 - - - - 1 0 1..979:+ 22..1000:+ -
1.8 KY887602.1 7547 PASS NC_039477 mat_peptide NTPase 1098 8 2 + 980 2077 - no 0 0 - - - - 1 0 980..2077:+ 1001..2098:+ -
1.9 KY887602.1 7547 PASS NC_039477 mat_peptide p22 531 9 2 + 2078 2608 - no 0 0 - - - - 1 0 2078..2608:+ 2099..2629:+ -
1.10 KY887602.1 7547 PASS NC_039477 mat_peptide VPg 399 10 2 + 2609 3007 - no 0 0 - - - - 1 0 2609..3007:+ 2630..3028:+ -
1.11 KY887602.1 7547 PASS NC_039477 mat_peptide Pro 543 11 2 + 3008 3550 - no 0 0 - - - - 1 0 3008..3550:+ 3029..3571:+ -
1.12 KY887602.1 7547 PASS NC_039477 mat_peptide RdRp 1530 12 2 + 3551 5080 - no 0 0 - - - - 1 0 3551..5080:+ 3572..5101:+ -
#
This file includes information on each annotated feature, organized by
sequence. The lines above show the annotated features for the first
sequence KY887602.1
and include information on the feature length,
strand, positions in the sequence and in the reference model, whether
it is truncated or not, and where the blastx protein validation stage
predicted coordinates are. The seq coords
and mdl coords
lines
near the end show the sequence and model coordinates of each feature
in the VADR coordinate string format, described
here.
The final column lists any alerts
pertaining to each feature. For the features in this sequence there
are no alerts, but for the final sequence, some alerts are listed for the
second CDS. An explanation of all fields in the .ftr
file type is here.
Another important file is the .alt
output file (va-noro9.vadr.alt
)
which includes one line per alert reported:
# seq ftr ftr ftr alert alert seq seq mdl mdl alert
#idx name model type name idx code fail description coords len coords len detail
#---- ---------- --------- ---- ---- --- -------- ---- --------------------------- ------------ --- ------------ --- ------
9.1.1 JN975492.1 NC_008311 CDS VF1 6 mutendcd yes MUTATION_AT_END 5683..5685:+ 3 5708..5710:+ 3 expected stop codon could not be identified, predicted CDS stop by homology is invalid [TCA]
9.1.2 JN975492.1 NC_008311 CDS VF1 6 cdsstopn yes CDS_HAS_STOP_CODON 5275..5277:+ 3 5300..5302:+ 3 in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:408,M:408]
9.1.3 JN975492.1 NC_008311 CDS VF1 6 cdsstopp yes CDS_HAS_STOP_CODON 5275..5277:+ 3 5300..5302:+ 3 stop codon in protein-based alignment [-]
9.1.4 JN975492.1 NC_008311 CDS VF1 6 indf3pst yes INDEFINITE_ANNOTATION_END 5650..5685:+ 36 5710..5710:+ 1 protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [36>5, no valid stop codon in nucleotide-based prediction]
9.2.1 JN975492.1 NC_008311 CDS VP2 8 indf5pst yes INDEFINITE_ANNOTATION_START 6656..6709:+ 54 6681..6681:+ 1 protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint [54>5]
All alerts are for the JN975492.1
sequence. Four are for the VF1 CDS
and 1 is for the VP2 CDS. The alert codes are listed in the seventh
column, along with a brief description in the eigth column.
Then the sequence and model coordinates pertaining to the alert and
the lengths of those regions are listed in columns 9 to 12.
A more detailed description of the problem can be found in the final column.
All possible alerts are listed in the alert
table.
For some examples of different types of alerts see
here
Example of using the v-annotate.pl
--alt_pass
and --alt_fail
to change alerts from fatal to non-fatal and vice versa
One way to change the behavior of v-annotate.pl
is to change which
alerts are fatal or non-fatal. Most alerts are fatal by default, but
some are not, as shown in the alert table. Some alerts are
always fatal in that they cannot be changed, but all others can be
toggled between fatal or non-fatal using the --alt_pass
and --alt_fail
options.
For example, we can make the JN976492.1
sequence above pass by
making the five observed alerts (mutendcd, cdsstopn, cdsstopp,
indf3pst, and indf5pst) by rerunning the v-annotate.pl
above with this command:
v-annotate.pl --alt_pass mutendcd,cdsstopn,cdsstopp,indf3pst,indf5pst $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-pass-noro.9
To supply multiple alerts with --alt_pass
or --alt_fail
, separate them by a ,
without any whitespace, like above.
The output will look very similar to the earlier run, but the summary information printed at the end will show that no sequences fail this time, despite the same alerts being reported.
# Summary of classified sequences:
#
# num num num
#idx model group subgroup seqs pass fail
#--- --------- --------- -------- ---- ---- ----
1 NC_008311 Norovirus GV 2 2 0
2 NC_039477 Norovirus GII 2 2 0
3 NC_044854 Norovirus GI 2 2 0
4 NC_029645 Norovirus GIII 2 2 0
5 NC_001959 Norovirus GI 1 1 0
#--- --------- --------- -------- ---- ---- ----
- *all* - - 9 9 0
- *none* - - 0 0 0
#--- --------- --------- -------- ---- ---- ----
#
# Summary of reported alerts:
#
# alert causes short per num num long
#idx code failure description type cases seqs description
#--- -------- ------- --------------------------- ------- ----- ---- -----------
1 mutendcd no MUTATION_AT_END feature 1 1 expected stop codon could not be identified, predicted CDS stop by homology is invalid
2 cdsstopn no CDS_HAS_STOP_CODON feature 1 1 in-frame stop codon exists 5' of stop position predicted by homology to reference
3 cdsstopp no CDS_HAS_STOP_CODON feature 1 1 stop codon in protein-based alignment
4 indf5pst no INDEFINITE_ANNOTATION_START feature 1 1 protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint
5 indf3pst no INDEFINITE_ANNOTATION_END feature 1 1 protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint
#--- -------- ------- --------------------------- ------- ----- ---- -----------
The --alt_fail <s>
option works the same way as alt_pass <s>
but alert codes
in <s>
should be non-fatal by default.
Alternatively, if you want to relax the stringency of alerts only for
specific features you can do that by modifying the modelinfo
input
file as explained below.
The most time-consuming stages of v-annotate.pl
(classification,
coverage determination and alignment) can be parallelized on a cluster
by splitting up the input sequence file randomly into multiple files,
and running each as a separate job. This is most beneficial for large
input sequence files. Parallel mode is invoked with the -p
option.
By default, v-annotate.pl
will consult the file
$VADRSCRIPTSDIR/vadr.qsubinfo
to read the command prefix and suffix
for submitting jobs to the cluster. This file is set up to use Univa
Grid Engine (UGE 8.5.5) and specific flags used on the NCBI system,
but you can either modify this file to work with your own cluster or
create a new file <s>
and use the option -q <s>
to read that file.
The $VADRSCRIPTSDIR/vadr.qsubinfo
has comments at the top that
explain the format of the file. Email eric.nawrocki@nih.gov for help.
To repeat the above v-annotate.pl
run in parallel mode, use this command:
v-annotate.pl -p $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-parallel-noro.9
The output will look very similar to the run without -p
, but with additional lines of
output explaining that jobs have been submitted and are running on the compute farm:
# Submitting 1 cmscan classification job(s) to the farm ...
# Waiting a maximum of 500 minutes for all farm jobs to finish ...
# 0 of 1 jobs finished (0.2 minutes spent waiting)
# 0 of 1 jobs finished (0.5 minutes spent waiting)
# 0 of 1 jobs finished (0.8 minutes spent waiting)
# 0 of 1 jobs finished (1.0 minutes spent waiting)
# 1 of 1 jobs finished (1.2 minutes spent waiting)
# done. [ 75.7 seconds]
# Submitting 1 cmsearch coverage determination job(s) (NC_001959: 1 seqs) to the farm ...
# Waiting a maximum of 500 minutes for all farm jobs to finish ...
# 1 of 1 jobs finished (0.2 minutes spent waiting)
# done. [ 15.2 seconds]
Usage of -p
will not affect the output of v-annotate.pl
other than
these lines about the status of jobs, but it can make processing of
large sequence files significantly faster depending on how busy the
cluster is.
To get a list of command-line options, execute:
v-annotate.pl -h
This will output the usage and available command-line options.
Each option has a short description, but additional information on some
of these options can be found below.
For v-annotate.pl
the available options are split into nine different categories,
each explained in their own subsection below.
In the tables describing options below, <s>
represents a string,
<x>
indicates a floating point number and <n>
represents an
integer.
......option.... | explanation |
---|---|
-f |
if <output directory> already exists, then using this option will cause it to be overwritten, otherwise the progam exits in error |
-v |
verbose mode: all commands will be output to standard output as they are run |
--atgonly |
only consider ATG as a valid start codon, regardless of model's translation table |
--minpvlen <n> |
set the minimum length in nucleotides for CDS/mat_peptide/gene features to be output to feature tables and for protein validation analysis to <n> , default <n> is 30 |
--nkb <n> |
set the target number of Kb of sequence for each alignment job and/or chunk (with --split) to <n> Kb (thousand nucleotides), default <n> is 300 |
--keep |
keep additional output files that are normally removed |
..........option.......... | explanation |
---|---|
--group <s> |
specify that the expected classification of all sequences is group <s> , sequences determined to not be in this group will trigger an incgroup alert |
--subgroup <s2> |
specify that the expected classification of all sequences is subgroup <s> within group <s2> from --group <s2> , sequences determined to not be in this group will trigger an incsubgrp alert; requires --group |
............option............ | explanation |
---|---|
--alt_list |
output summary of all alerts and then exit |
--alt_pass <s> |
specify that alert codes in comma-separated string <s> are non-fatal (do not cause a sequence to fail), all alert codes listed must be fatal by default |
--alt_fail <s> |
specify that alert codes in comma-separated string <s> are fatal (cause a sequence to fail), all alert codes listed must be non-fatal by default |
--alt_mnf_yes <s> |
specify that alert codes in comma-separated string <s> for 'misc_not_failure' features cause misc_feature-ization, not failure as explained more here |
--alt_mnf_no <s> |
specify that alert codes in comma-separated string <s> for 'misc_not_failure' features cause failure, not misc-feature-ization as explained more here |
............option............ | explanation |
---|---|
--ignore_mnf |
ignore non-zero 'misc_not_feature' values in modelinfo file, set to 0 for all features/models |
--ignore_isdel |
ignore non-zero 'is_deletable' values in modelinfo file, set to 0 for all features/models |
--ignore_afset |
ignore non-zero 'alternative_ftr_set' and 'alternative_ftr_set_subn' values in modelinfo file |
--ignore_afsetsubn |
ignore non-zero 'alternative_ftr_set_subn' values in modelinfo file |
.......option....... | explanation |
---|---|
-m <s> |
use the CM file <s> , instead of the default CM file ($VADRMODELDIR/vadr.cm) |
-a <s> |
use HMM file <s> instead of the default HMM file ($VADRMODELDIR/vadr.hmm) |
-i <s> |
use the VADR model info file <s> , instead of the default model info file ($VADRMODELDIR/vadr.minfo) |
-n <s> |
use the blastn DB file <s> when necessary, instead of the default blastn DB file ($VADRMODELDIR/vadr.fa), only used if -s or -r is also used |
-x <s> |
specify that the blastx database files to use for protein validation are in dir <s> , instead of the default directory ($VADRMODELDIR) |
--mkey <s> |
specify that .cm, .minfo, and blastn .fa files in $VADRMODELDIR start with key <s> , not 'vadr' |
--mdir <s> |
specify that all model files to use are in the directory <s> , not in $VADRMODELDIR |
--mlist <s> |
specify that only the subset of models listed in the file <s> be used |
In the table below, <n>
represents a positive interger argument and
<x>
represents a positive floating-point argument.
...........option........... | relevant alert code(s) | relevant error(s) | default value that triggers alert | explanation |
---|---|---|---|---|
--lowsc <x> |
lowscore | LOW_SCORE | < 0.3 | set bits per nt threshold for alert to <x> |
--indefclass <x> |
indfclas | INDEFINITE_CLASSIFICATION | < 0.03 | set bits per nt difference threshold for alert between top two models (not in same subgroup) to <x> |
--incspec <x> |
incgroup, incsubgrp | INCORRECT_SPECIFIED_GROUP, INCORRECT_SPECIFIED_SUBGROUP | < 0.2 | set bits per nt difference threshold for alert between best-matching model <m> and highest-scoring model in specified group <s1> (from --group <s1> ) or subgroup <s2> (from --subgroup <s2> ), where <m> is not in group/subgroup <s1> /<s2> to <x> |
--lowcov <x> |
lowcovrg | LOW_COVERAGE | < 0.9 | set fractional coverage threshold for alert to <x> |
--dupregolp <n> |
dupregin | DUPLICATE_REGIONS | >= 20 | set min number of model position overlap for alert to <n> positions |
--dupregsc <x> |
dupregin | DUPLICATE_REGIONS | >= 10.0 | set min bit score of weaker overlapping hit to <x> bits |
--indefstr <x> |
indfstrn | INDEFINITE_STRAND | >= 25.0 | set bit score of weaker strand hit for alert to <x> |
--lowsim5seq <n> |
lowsim5s | LOW_SIMILARITY_START | >= 15 | set length (nt) threshold for alert to <n> |
--lowsim3seq <n> |
lowsim3s | LOW_SIMILARITY_END | >= 15 | set length (nt) threshold for alert to <n> |
--lowsimiseq <n> |
lowsimis | LOW_SIMILARITY | >= 1 | set length (nt) threshold for alert to <n> |
--lowsim5ftr <n> |
lowsim5f | LOW_FEATURE_SIMILARITY_START | >= 5 | set length (nt) threshold for alert to <n> |
--lowsim3ftr <n> |
lowsim3f | LOW_FEATURE_SIMILARITY_END | >= 5 | set length (nt) threshold for alert to <n> |
--lowsimiftr <n> |
lowsimif | LOW_FEATURE_SIMILARITY | >= 1 | set length (nt) threshold for alert to <n> |
--biasfrac <x> |
biasdseq | BIASED_SEQUENCE | >= 0.25 | set fractional bit score threshold for biased score/total score for alert to <x> |
--nmiscftrthr <n> |
nmiscftr | TOO_MANY_MISC_FEATURES | >= 4 | set minimum number of misc_features per sequence for alert to <n> |
--indefann <x> |
indf5lcc, indf5lcn, indf3lcc, indf3lcn | INDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END | < 0.8 | set posterior probability threshold for non-mat_peptide features for alert to <x> |
--indefann_mp <x> |
indf5lcc, indf5lcn, indf3lcc, indf3lcn | INDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END | < 0.6 | set posterior probability threshold for mat_peptide features for alert to <x> |
--fstminntt <n> |
fsthicft, fstlocft, fstukct5 | POSSIBLE_FRAMESHIFT_HIGH_CONF, POSSIBLE_FRAMESHIFT_LO_CONF, POSSIBLE_FRAMESHIFT | >= 4 | set maximum allowed length of aligned region in different frame in which frame is not restored before CDS end to <n> |
--fstminnti <n> |
fsthicfi, fstlocfi, fstukcfi | POSSIBLE_FRAMESHIFT_HIGH_CONF, POSSIBLE_FRAMESHIFT_LO_CONF, POSSIBLE_FRAMESHIFT | >= 6 | set maximum allowed length of aligned region in different frame in which frame is restored before CDS end to <n> |
--fsthighthr <x> |
fsthicnf | POSSIBLE_FRAMESHIFT_HIGH_CONF | >= 0.8 | set average posterior probability threshold for potentially frameshifted region for high confidence alert to <x> |
--fstlowthr <x> |
fstlocnf | POSSIBLE_FRAMESHIFT_LOW_CONF | >= 0.0 | set average posterior probability threshold for potentially frameshifted region for low confidence alert to <x> |
--xalntol <n> |
indf5pst, indf3pst | INDEFINITE_ANNOTATION_START, INDEFINITE_ANNOTATION_END | > 5 | set maximum allowed difference in nucleotides between predicted blastx and CM start/end without alert to <n> (blastx coordinates must be internal to CM coordinates) |
--xmaxins <n> |
insertnp | INSERTION_OF_NT | > 27 | set maximum allowed nucleotide insertion length in blastx validation alignment without alert to <n> |
--xmaxdel <n> |
deletinp | DELETION_OF_NT | > 27 | set maximum allowed nucleotide deletion length in blastx validation alignment without alert to <n> |
--nmaxins <n> |
insertnn | INSERTION_OF_NT | > 27 | set maximum allowed nucleotide insertion length in CDS nt alignment without alert to <n> |
--nmaxdel <n> |
deletinn | DELETION_OF_NT | > 27 | set maximum allowed nucleotide deletion length in CDS nt alignment without alert to <n> |
--xlonescore <n> |
indfantp | INDEFINITE_ANNOTATION | >= 80 | set minimum blastx raw score for a lone blastx hit not supported by CM analysis for alert to <n> |
--hlonescore <n> |
indfantp | INDEFINITE_ANNOTATION | >= 10 | set minimum hmmer bit score for a lone hmmsearch hit not supported by CM analysis for alert to <n> |
Several options exist for controlling the command-line options that will be passed
to Infernal's cmalign
program in the alignment stage. For more information on these options and how
they control cmalign
, see the Infernal
User's Guide manual page for cmalign
(section 8 of http://eddylab.org/infernal/Userguide.pdf)
.........option......... | explanation |
---|---|
--mxsize <n> |
set maximum allowed cmalign DP matrix size to <n> Mb before triggering an unexpdivg alert, default <n> is 16000 |
--tau <x> |
set the initial tau (probability loss) value to <x> (sets the cmalign --tau option), default <x> is 0.001 |
--nofixedtau |
do not fix the tau value, allow it to increase if necessary (removes the cmalign --fixedtau option), default is to fix tau with cmalign --fixedtau |
--nosub |
use alternative alignment strategy for truncated sequences (removes the cmalign --sub --notrunc options), default is use sub-CM alignment strategy with cmalign --sub --notrunc |
--noglocal |
run in local mode instead of glocal mode (removes the cmalign -g option), default is to use glocal mode with cmalign -g |
--cmindi |
force cmalign to align one sequence at a time, mainly useful for debugging |
The glsearch
program from the FASTA package
can be used as an alternative to the cmalign
program.
For more information on these options and how they control glsearch
, see the FASTA documentation
(https://fasta.bioch.virginia.edu/wrp_fasta/fasta_guide.pdf).
.........option......... | explanation |
---|---|
--glsearch |
align with glsearch instead of cmalign |
--gls_match <n> |
set glsearch match score to <n> > 0 (-r option in glsearch), default is `5' |
--gls_mismatch <n> |
set glsearch mismatch score to <n> < 0 (-r option in glsearch), default is -3 |
--gls_gapopen <n> |
set glsearch gap open score to <n> < 0 (-f option in glsearch), default is -17 |
--gls_gapextend <n> |
set glsearch gap extend score to <n> < 0 (-g option in glsearch), default is -4 |
Below is a list of options for controlling the blastx protein
validaation stage. Several of these control command-line options that
will be passed to blastx
. For more information on these options and
how they control blastx
, see the NCBI BLAST documentation
(tables C1 and C4 of https://www.ncbi.nlm.nih.gov/books/NBK279684/).
.........option......... | explanation |
---|---|
--xmatrix <s> |
use the substitution matrix <s> (sets the blastx -matrix <s> option), default is to use the default blastx matrix |
--xdrop <n> |
set the xdrop options to <n> (sets the blastx -xdrop_ungap <n> , -xdrop_gap <n> and -xdrop_gap_final <n> with the same <n> ), default is to use default blastx values |
--xnumali <n> |
specify that the top <n> alignments are output by blastx , mostly relevant in combination with --xlongest (sets the blastx -num_alignments <n> option), default <n> is 20 |
--xlongest |
use the longest blastx alignment of those returned (controlled by --xnumali <n> ), default is to use the highest scoring alignment |
Optionally, HMMER's hmmsearch program can be used instead of blastx for the protein validation stage.
CAUTION: This feature is relatively new and untested.
Several of these control command-line options that
will be passed to blastx
. For more information on HMMER, see
the HMMER user's guide (http://eddylab.org/software/hmmer/Userguide.pdf).
......option...... | explanation |
---|---|
--pv_hmmer |
use hmmer instead of blastx for protein validation |
--h_max |
use the --max option with hmmsearch |
--h_minbit <x> |
set the minimum hmmsearch bit score threshold to <x> , the default <x> is -10 . |
The -s
option turns on an acceleration heuristic based on a
first-pass blastn alignment of each input sequence. With -s
,
blastn is used instead of cmscan for sequence classification,
and the largest ungapped alignment region, called the 'seed', is
extracted from the top hit blastn hit, and fixed for the alignment
stage, such that only the sequence before and after the fixed seed is
aligned with cmalign. This option was originally developed for
SARS-CoV-2, for which it offers significant acceleration for many
sequences which are highly similar to the SARS-CoV-2 RefSeq model.
When -s
option is used, an additional output file with suffix .sda
is created,
with format described here.
.........option......... | explanation |
---|---|
-s |
turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment |
--s_blastnws <n> |
for -s , set the blastn -word_size parameter to <n> , the default value for <n> is 7 |
--s_blastnrw <n> |
for -s , set the blastn -reward parameter to <n> , the default value for <n> is 1 |
--s_blastnpn <n> |
for -s , set the blastn -penalty parameter to <n> , the default value for <n> is -2 |
--s_blastngo <n> |
for -s , set the blastn -gapopen parameter to <n> , the default value for <n> is 2 |
--s_blastnge <n> |
for -s , set the blastn -gapextend parameter to <n> , the default value for <n> is 1 |
--s_blastndf |
for -s , do not use -gapopen/-gapextend options with blastn, use default values for gap penalties |
--s_blastnsc <x> |
for -s , set the blastn minimum HSP score to consider to <x> , the default value for <x> is 50.0 |
--s_blastntk |
for -s , set blastn option -task blastn |
--s_blastnxd <n> |
for -s , set the blastn -xdrop_gap_final parameter to <n> , the default value for <n> is 110 |
--s_minsgmlen <n> |
for -s , set minimum length of ungapped region in HSP seed to <n> , the default value for <n> is 10 |
--s_allsgm |
for -s , keep full HSP as seed, do not enforce a minimum segment length |
--s_ungapsgm |
for -s , only keep max length ungapped segment of HSP, this was default behavior for vadr v1.1 to v1.3 |
--s_startstop |
for -s , allow seed to include gaps in start/stop codons |
--s_overhang <n> |
for -s , set the length, in nt, of overlap between the 5' and 3' regions that are aligned with cmalign and the seed region to <n> , the default value for <n> is 100 |
The -r
option adds a pre-processing step to v-annotate.pl
in which
stretches of Ns are identified in each sequence and replaced with the
expected nucleotides at the corresponding positions, when possible.
This can sidestep problems with annotation that the Ns would normally
cause. However, this option should be used with caution because it is
based on the assumption that the missing regions match exactly to the
expected nucleotide sequence that correspond to those missing regions.
Regions of Ns are identified using blastn and examining regions between hits for content of Ns. Ns in regions that satisfy the following three criteria are then replaced with the expected nucleotide at each corresponding position:
-
missing sequence region must be at least 5 nt (controllable with
--r_minlen
option) -
length of missing sequence region must equal length of missing model region
-
missing sequence region must be
>= 0.25
fraction Ns if it includes the 5' end or 3' end of the sequence, or>= 0.50
fraction Ns if it does not (controllable with--r_minfract5
,--r_minfract3
and--r_minfracti
options).
Additionally, as of v1.4, regions for which the length of the missing sequence region and missing model region are not identical are also potentially replaced if the following criteria are met:
-
length of missing model region is
10
nt or less longer than the length of missing sequence region (controllable with--r_diffmaxdel
option) OR length of missing model region is10
nt or less shorter than the length of missing sequence region (controllable with--r_diffmaxins
option) -
at least
1
of the nt in the missing sequence region is not an N (controllable with--r_diffminnonn
option) -
fraction of non-N nt in sequence region that match expected nt after "aligning" sequence region by flushing left or right with respect to differently length model region is at least
0.75
(controllable with--r_diffminfract
option) -
--r_diffno
option is not used
When -r
is used, an additional output file with suffix .rpn
is created,
with format described here.
...........option........... | explanation |
---|---|
-r |
turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible |
--r_minlen <n> |
for -r , set minimum length subsequence to possibly replace Ns in to <n> , the default value for <n> is 5 |
--r_minfract5 <f> |
for -r , set the minimum fraction of nucleotides in a subsequence at the 5' end to trigger N replacement to <x> , the default value for <x> is 0.25 |
--r_minfract3 <f> |
for -r , set the minimum fraction of nucleotides in a subsequence at the 3' end to trigger N replacement to <x> , the default value for <x> is 0.25 |
--r_minfracti <f> |
for -r , set the minimum fraction of nucleotides in an internal subsequence to trigger N replacement to <x> , the default value for <x> is 0.5 |
--r_diffno |
do not try replacement of N rich regions if sequence and model regions are of different lengths, the default is to try if criteria defined by other --r_diff* options are met |
--r_diffmaxdel |
maxium allowed length difference b/t sequence and model regions (when model length > sequence length) to try replacement is <n> nt, the default value for <n> is 10 |
--r_diffmaxins |
maxium allowed length difference b/t sequence and model regions (when sequence length > model length) to try replacement is <n> nt, the default value for <n> is 10 |
--r_diffminnonn |
minimum number of non-N nts in replacement region when model and sequence region are different lengths to try replacement is <n> , the default value for <n> is 1 |
--r_diffminfract |
minimum allowed fraction of non-N nts that must match expected nt from reference model in replacement region when model and sequence region are different lengths is <f> , the default value for <f> is 0.75 |
--r_fetchr |
for -r , fetch features to fasta files from sequences with Ns replaced, instead of original input sequences without Ns replaced |
--r_cdsmpr |
for -r , identify CDS- and mat_peptide-specific alerts using subsequences fetched from sequences with Ns replaced, instead of original input sequences without Ns replaced |
--r_pvorig |
for -r , use original input sequences without Ns replaced in protein validation stage, instead of sequences with Ns replaced |
--r_prof |
for -r , use slower profile methods, not blastn, to identify Ns to replaced |
--r_list |
for -r , only use models listed in file <s> for N replacement stage |
--r_only <s> |
for -r , only use model named <s> for N replacement stage |
--r_blastnws <n> |
for -r , set the blastn -word_size parameter to <n> , the default value for <n> is 7 |
--r_blastnrw <n> |
for -r , set the blastn -reward parameter to <n> , the default value for <n> is 1 |
--r_blastnpn <n> |
for -r , set the blastn -penalty parameter to <n> , the default value for <n> is -2 |
--r_blastngo <n> |
for -r , set the blastn -gapopen parameter to <n> , the default value for <n> is 2 |
--r_blastnge <n> |
for -r , set the blastn -gapextend parameter to <n> , the default value for <n> is 1 |
--r_blastndf |
for -r , do not use -gapopen/-gapextend options with blastn, use default values for gap penalties |
--r_blastnsc <x> |
for -r , set the blastn minimum HSP score to consider to <x> , the default value for <x> is 50.0 |
--r_blastntk |
for -r , set blastn option -task blastn |
--r_blastnxd <n> |
for -r , set the blastn -xdrop_gap_final parameter to <n> , the default value for <n> is 110 |
v-annotate.pl
options related to splitting input sequence file into chunks and processing each chunk separately and potentially in parallel
The --split
option specifies that v-annotate.pl
should split up the input file into chunks and
processing each chunk separately and then combining results at the end after all chunks have been processed.
This limits total memory usage for large input sequence files as
explained more here.
........option........ | explanation |
---|---|
--split |
split input fasta sequence file into chunks of <n> Kb where <n> is from --nkb <n> (300 Kb, by default) and run each chunk separately |
--cpu <n> |
with --split or --glsearch, parallelize across <n> CPU threads/workers (requires --split oor --glsearch) |
--sidx <n> |
start sequence indexing at <n> for output files, not intended to be set by user except when debugging |
........option........ | explanation |
---|---|
-p |
run in parallel mode so that classification, and each per-model coverage determination and alignment step is split into multiple jobs and run in parallel on a cluster |
-q <s> |
read cluster information file from file <s> instead of from the default file $VADRSCRIPTSDIR/vadr.qsubinfo |
--errcheck |
consider any output to STDERR from a parallel job as an indication the job has failed, this will cause v-annotate.pl to exit, default is to ignore output to STDERR |
........option........ | explanation |
---|---|
--wait <n> |
set the total number of minutes to wait for all jobs to finish at each stage to <n> , if any job is not finished this many minutes after being submitted (as indicated by the existence of an expected output file) then v-annotate.pl will exit in error, default <n> is 500 |
--maxnjobs <n> |
set the maximum number of jobs at each stage to <n> , default <n> is 2500 |
......option...... | explanation |
---|---|
--pv_skip |
do not perform protein validation stage for CDS |
--align_skip |
skip the cmalign stage, use results from previous run, this is mostly useful for debugging purposes |
--val_only |
validate CM and other input files and exit |
.......option....... | explanation |
---|---|
--out_stk |
create additional per-model output stockholm alignments with .stk suffix |
--out_afa |
create additional per-model output aligned fasta alignments with .afa suffix |
--out_rpstk |
with -r , create additional per-model output stockholm alignments with sequences with Ns replaced with .rpstk suffix |
--out_rpafa |
create additional per-model output aligned fasta alignments with sequences with Ns replaced with .rpafa suffix |
--out_fsstk |
output frameshift stockholm alignment files with .frameshift.stk suffix |
--out_allfasta |
output fasta files of predicted features |
--out_nofasta |
minimize total size of output; do not output fasta files of all passing and all failing sequences |
--out_debug |
create additional output files with information on various data structures |
.........option......... | explanation |
---|---|
--execname <s> |
in banner and usage output, replace v-annotate.pl with <s> |
--alicheck |
for debugging purposes, check aligned sequence versus input sequence for identity |
--noseqnamemax |
do not enforce the GenBank maximum length of 50 characters for sequence names |
--minbit <x> |
set minimum cmsearch/cmscan bit score threshold to <x> , the default value for <x> is -10 |
--origfa |
do not copy the input fasta file into output directory prior to analysis, use the original |
--msub <s> |
specify that file <s> lists models to substitute, each line should contain two space-delimited tokens, model listed in token 2 will substitute as best-matching model for all sequences classified as the model listed in token 1 |
--xsub <s> |
specify that file <s> lists blastx dbs to substitute, each line should contain two space-delimited tokens, blastx db for model listed in token 2 will substitute as blastx db for all sequences classified as the model listed in token 1 |
--nodcr |
never doctor alignments to shift gaps to correct start/stop codon annotation |
--forcedcrins |
force insert type alignment doctoring, requires --cmindi , mainly useful for debugging/testing |
--xnoid |
ignore blastx hits that are full length and 100% identical, mainly useful for testing |
To see a table with information on alerts, use the --alt_list
option, like this:
v-annotate.pl --alt_list
The table below contains the same information as in the --alt_list
output,
with sequences organized according to whether they are fatal or not.
Always fatal alert codes are always fatal and cannot be changed using the
--alt_pass
options. All other alert codes can be changed from fatal to non-fatal
by using the --alt_pass
option, or from non-fatal to fatal using the --alt_fail
option.
An example is included below.
In the table below, the type column reports if each alert pertains to an entire
sequence
or a specific annotated feature
within a sequence. The
causes misc_feature
, not failure (if in
modelinfo file shows which alerts are not fatal for expendable
features as described more below.
alert code | type | causes misc_feature , not failure (if in modelinfo file) |
short description/error name | long description |
---|---|---|---|---|
noannotn | sequence | never | NO_ANNOTATION | no significant similarity detected |
revcompl | sequence | never | REVCOMPLEM | sequence appears to be reverse complemented |
unexdivg | sequence | never | UNEXPECTED_DIVERGENCE | sequence is too divergent to confidently assign nucleotide-based annotation |
noftrann | sequence | never | NO_FEATURES_ANNOTATED | sequence similarity to homology model does not overlap with any features |
noftrant | sequence | never | NO_FEATURES_ANNOTATED | all annotated features are too short to be output to feature table |
ftskipfl | sequence | never | UNREPORTED_FEATURE_PROBLEM | only fatal alerts are for feature(s) not output to feature table |
alert code | type | causes misc_feature , not failure (if in modelinfo file) |
short description/error name | long description |
---|---|---|---|---|
incsbgrp | sequence | never | INCORRECT_SPECIFIED_SUBGROUP | score difference too large between best overall model and best specified subgroup model |
incgroup | sequence | never | INCORRECT_SPECIFIED_GROUP | score difference too large between best overall model and best specified group model |
lowcovrg | sequence | never | LOW_COVERAGE | low sequence fraction with significant similarity to homology model |
dupregin | sequence | never | DUPLICATE_REGIONS | similarity to a model region occurs more than once |
discontn | sequence | never | DISCONTINUOUS_SIMILARITY | not all hits are in the same order in the sequence and the homology model |
indfstrn | sequence | never | INDEFINITE_STRAND | significant similarity detected on both strands |
lowsim5s | sequence | never | LOW_SIMILARITY_START | significant similarity not detected at 5' end of the sequence |
lowsim3s | sequence | never | LOW_SIMILARITY_END | significant similarity not detected at 3' end of the sequence |
lowsimis | sequence | never | LOW_SIMILARITY | internal region without significant similarity |
nmiscftr | sequence | never | TOO_MANY_MISC_FEATURES | too many features reported as misc_features |
deletins | sequence | never | DELETION_OF_FEATURE | internal deletion of a complete feature |
mutstart | feature | yes | MUTATION_AT_START | expected start codon could not be identified |
mutendcd | feature | yes | MUTATION_AT_END | expected stop codon could not be identified, predicted CDS stop by homology is invalid |
mutendns | feature | yes | MUTATION_AT_END | expected stop codon could not be identified, no in-frame stop codon exists 3' of predicted valid start codon |
mutendex | feature | yes | MUTATION_AT_END | expected stop codon could not be identified, first in-frame stop codon exists 3' of predicted stop position |
unexleng | feature | yes | UNEXPECTED_LENGTH | length of complete coding (CDS or mat_peptide) feature is not a multiple of 3 |
cdsstopn | feature | yes | CDS_HAS_STOP_CODON | in-frame stop codon exists 5' of stop position predicted by homology to reference |
cdsstopp | feature | yes | CDS_HAS_STOP_CODON | stop codon in protein-based alignment |
fsthicft | feature | yes | POSSIBLE_FRAMESHIFT_HIGH_CONF | high confidence possible frameshift in CDS (frame not restored before end) (not reported if --glsearch |
fsthicfi | feature | yes | POSSIBLE_FRAMESHIFT_HIGH_CONF | high confidence possible frameshift in CDS (frame restored before end) (not reported if --glsearch ) |
fstukcf3 | feature | yes | POSSIBLE_FRAMESHIFT | possible frameshift in CDS (frame not restored before end) (only reported if --glsearch ) |
fstukcfi | feature | yes | POSSIBLE_FRAMESHIFT | possible frameshift in CDS (frame restored before end) (only reported if --glsearch ) |
peptrans | feature | yes | PEPTIDE_TRANSLATION_PROBLEM | mat_peptide may not be translated because its parent CDS has a problem |
pepadjcy | feature | yes | PEPTIDE_ADJACENCY_PROBLEM | predictions of two mat_peptides expected to be adjacent are not adjacent |
indfantp | feature | no | INDEFINITE_ANNOTATION | protein-based search identifies CDS not identified in nucleotide-based search |
indfantn | feature | no | INDEFINITE_ANNOTATION | nucleotide-based search identifies CDS not identified in protein-based search |
indf5gap | feature | yes | INDEFINITE_ANNOTATION_START | alignment to homology model is a gap at 5' boundary |
indf5lcn | feature | yes | INDEFINITE_ANNOTATION_START | alignment to homology model has low confidence at 5' boundary for feature that does not match a CDS |
indf5plg | feature | yes | INDEFINITE_ANNOTATION_START | protein-based alignment extends past nucleotide-based alignment at 5' end |
indf5pst | feature | yes | INDEFINITE_ANNOTATION_START | protein-based alignment does not extend close enough to nucleotide-based alignment 5' endpoint |
indf3gap | feature | yes | INDEFINITE_ANNOTATION_END | alignment to homology model is a gap at 3' boundary |
indf3lcn | feature | yes | INDEFINITE_ANNOTATION_END | alignment to homology model has low confidence at 3' boundary for feature that does not match a CDS |
indf3plg | feature | yes | INDEFINITE_ANNOTATION_END | protein-based alignment extends past nucleotide-based alignment at 3' end |
indf3pst | feature | yes | INDEFINITE_ANNOTATION_END | protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint |
indfstrp | feature | no | INDEFINITE_STRAND | strand mismatch between protein-based and nucleotide-based predictions |
insertnp | feature | no | INSERTION_OF_NT | too large of an insertion in protein-based alignment |
deletinp | feature | yes | DELETION_OF_NT | too large of a deletion in protein-based alignment |
deletinf | feature | no | DELETION_OF_FEATURE_SECTION | internal deletion of a complete section in a multi-section feature with other section(s) annotated |
lowsim5n | feature | no | LOW_FEATURE_SIMILARITY_START | region within annotated feature that does not match a CDS at 5' end of sequence lacks significant similarity |
lowsim3n | feature | no | LOW_FEATURE_SIMILARITY_END | region within annotated feature that does not match a CDS at 3' end of sequence lacks significant similarity |
lowsimin | feature | no | LOW_FEATURE_SIMILARITY | region within annotated feature that does not match a CDS lacks significant similarity |
alert code | type | causes misc_feature , not failure (if in modelinfo file) |
short description/error name | long description |
---|---|---|---|---|
qstsbgrp | sequence | never | QUESTIONABLE_SPECIFIED_SUBGROUP | best overall model is not from specified subgroup |
qstgroup | sequence | never | QUESTIONABLE_SPECIFIED_GROUP | best overall model is not from specified group |
ambgnt5s | sequence | never | AMBIGUITY_AT_START | first nucleotide of the sequence is an ambiguous nucleotide |
ambgnt3s | sequence | never | AMBIGUITY_AT_END | final nucleotide of the sequence is an ambiguous nucleotide |
indfclas | sequence | never | INDEFINITE_CLASSIFICATION | low score difference between best overall model and second best model (not in best model's subgroup) |
lowscore | sequence | never | LOW_SCORE | score to homology model below low threshold |
biasdseq | sequence | never | BIASED_SEQUENCE | high fraction of score attributed to biased sequence composition |
unjoinbl | sequence | never | UNJOINABLE_SUBSEQ_ALIGNMENTS | inconsistent alignment of overlapping region between ungapped seed and flanking region |
deletina | sequence | never | DELETION_OF_FEATURE | allowed internal deletion of a complete feature (feature with is_deletable flag set to 1 in .minfo file) |
ambgntrp | sequence | never | N_RICH_REGION_NOT_REPLACED | N-rich region of unexpected length not replaced during N replacement region (only possibly reported if -r ) |
fstlocft | feature | yes | POSSIBLE_FRAMESHIFT_LOW_CONF | low confidence possible frameshift in CDS (frame not restored before end) (not reported if --glsearch ) |
fstlocfi | feature | yes | POSSIBLE_FRAMESHIFT_LOW_CONF | low confidence possible frameshift in CDS (frame restored before end) (not reported if --glsearch ) |
indf5lcc | feature | yes | INDEFINITE_ANNOTATION_START | alignment to homology model has low confidence at 5' boundary for feature that is or matches a CDS |
indf3lcc | feature | yes | INDEFINITE_ANNOTATION_END | alignment to homology model has low confidence at 3' boundary for feature that is or matches a CDS |
insertnn | feature | no | INSERTION_OF_NT | too large of an insertion in nucleotide-based alignment of CDS feature |
deletinn | feature | yes | DELETION_OF_NT | too large of a deletion in nucleotide-based alignment of CDS feature |
lowsim5c | feature | no | LOW_FEATURE_SIMILARITY_START | region within annotated feature that is or matches a CDS at 5' end of sequence lacks significant similarity |
lowsim3c | feature | no | LOW_FEATURE_SIMILARITY_END | region within annotated feature that is or matches a CDS at 3' end of sequence lacks significant similarity |
lowsimic | feature | no | LOW_FEATURE_SIMILARITY | region within annotated feature that is or matches a CDS lacks significant similarity |
ambgnt5f | feature | no | AMBIGUITY_AT_FEATURE_START | first nucleotide of non-CDS feature is an ambiguous nucleotide |
ambgnt3f | feature | no | AMBIGUITY_AT_FEATURE_END | final nucleotide of non-CDS feature is an ambiguous nucleotide |
ambgnt5c | feature | no | AMBIGUITY_AT_CDS_START | first nucleotide of CDS is an ambiguous nucleotide |
ambgnt3c | feature | no | AMBIGUITY_AT_CDS_END | final nucleotide of CDS is an ambiguous nucleotide |
ambgcd5c | feature | no | AMBIGUITY_IN_START_CODON | 5' complete CDS starts with canonical nt but includes ambiguous nt in its start codon |
ambgcd3c | feature | no | AMBIGUITY_IN_STOP_CODON | 3' complete CDS ends with canonical nt but includes ambiguous nt in its stop codon |
The table below has additional information on the alerts
not contained in the --alt_list
output.
The "relevant_options" column lists command-line options that
pertain to each alert. The [relevant feature types column
shows which feature types each alert can be reported for (this field is "-" for
alerts that pertain to a sequence instead of a feature). The
omitted in .tbl
and .alt.list
by column lists other alerts
that, if present, will cause this alert to be omitted in the .tbl
and .alt.list
files to reduce redundant information reported to
user, this is "-" for alerts that are never omitted from those files.
alert code | short description/error name | relevant options | relevant feature types | omitted in .tbl and .alt.list by |
---|---|---|---|---|
noannotn | NO_ANNOTATION | none | - | - |
revcompl | REVCOMPLEM | none | - | - |
unexdivg | UNEXPECTED_DIVERGENCE | none | - | - |
noftrann | NO_FEATURES_ANNOTATED | none | - | - |
noftrant | NO_FEATURES_ANNOTATED | none | - | - |
ftskipfl | UNREPORTED_FEATURE_PROBLEM | none | - | - |
alert code | short description/error name | relevant_options | relevant feature types | omitted in .tbl and .alt.list by |
---|---|---|---|---|
incsbgrp | INCORRECT_SPECIFIED_SUBGROUP | --incspec |
- | - |
incgroup | INCORRECT_SPECIFIED_GROUP | --incspec |
- | - |
lowcovrg | LOW_COVERAGE | --lowcov |
- | - |
dupregin | DUPLICATE_REGIONS | --dupreg |
- | - |
discontn | DISCONTINUOUS_SIMILARITY | none | - | - |
indfstrn | INDEFINITE_STRAND | --indefstr |
- | - |
lowsim5s | LOW_SIMILARITY_START | --lowsim5seq |
- | - |
lowsim3s | LOW_SIMILARITY_END | --lowsim3seq |
- | - |
lowsimis | LOW_SIMILARITY | --lowsimint |
- | - |
nmiscftr | TOO_MANY_MISC_FEATURES | --nmiscftrthr |
all | - |
deletins | DELETION_OF_FEATURE | none | all | - |
mutstart | MUTATION_AT_START | --atgonly |
CDS | - |
mutendcd | MUTATION_AT_END | none | CDS | cdsstopn, mutendex, mutendns |
mutendns | MUTATION_AT_END | none | CDS | - |
mutendex | MUTATION_AT_END | none | CDS | - |
unexleng | UNEXPECTED_LENGTH | none | CDS, mat_peptide | - |
cdsstopn | CDS_HAS_STOP_CODON | none | CDS | - |
cdsstopp | CDS_HAS_STOP_CODON | none | CDS | - |
fsthicft | POSSIBLE_FRAMESHIFT_HIGH_CONF | --fsthighthr , --fstminntt |
CDS | - |
fsthicfi | POSSIBLE_FRAMESHIFT_HIGH_CONF | --fsthighthr , --fstminnti |
CDS | - |
fstukcft | POSSIBLE_FRAMESHIFT | --glsearch , --fstminntt |
CDS | - |
fstukcfi | POSSIBLE_FRAMESHIFT | --glsearch , --fstminnti |
CDS | - |
peptrans | PEPTIDE_TRANSLATION_PROBLEM | none | mat_peptide | - |
pepadjcy | PEPTIDE_ADJACENCY_PROBLEM | none | mat_peptide | - |
indfantp | INDEFINITE_ANNOTATION | --xlonescore |
CDS | - |
indfantn | INDEFINITE_ANNOTATION | none | CDS | - |
indf5gap | INDEFINITE_ANNOTATION_START | none | all | - |
indf5lcn | INDEFINITE_ANNOTATION_START | --indefann , --indefann_mp |
all except CDS and any gene or mat_peptide with identical start coordinate to a CDS | - |
indf5plg | INDEFINITE_ANNOTATION_START | none | CDS | - |
indf5pst | INDEFINITE_ANNOTATION_START | --xalntol |
CDS | - |
indf3gap | INDEFINITE_ANNOTATION_END | none | all | - |
indf3lcn | INDEFINITE_ANNOTATION_END | --indefann , --indefann_mp |
all except CDS and any gene with identical stop coordinate to CDS | - |
indf3plg | INDEFINITE_ANNOTATION_END | none | CDS | - |
indf3pst | INDEFINITE_ANNOTATION_END | --xalntol |
CDS | - |
indfstrp | INDEFINITE_STRAND | none | CDS | - |
insertnp | INSERTION_OF_NT | --xmaxins |
CDS | - |
insertnn | INSERTION_OF_NT | --nmaxins |
CDS | - |
deletinp | DELETION_OF_NT | --xmaxdel |
CDS | - |
deletinf | DELETION_OF_FEATURE_SECTION | none | all | - |
lowsim5n | LOW_FEATURE_SIMILARITY_START | --lowsim5ftr |
all except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide | - |
lowsim3n | LOW_FEATURE_SIMILARITY_END | --lowsim3ftr |
all except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide | - |
lowsimin | LOW_FEATURE_SIMILARITY | --lowsimiftr |
all except CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide | - |
alert code | short description/error name | relevant_options | relevant feature types | omitted in .tbl and .alt.list by |
---|---|---|---|---|
qstsbgrp | QUESTIONABLE_SPECIFIED_SUBGROUP | none | - | - |
qstgroup | QUESTIONABLE_SPECIFIED_GROUP | none | - | - |
ambgnt5s | AMBIGUITY_AT_START | none | - | - |
ambgnt3s | AMBIGUITY_AT_END | none | - | - |
indfclas | INDEFINITE_CLASSIFICATION | --indefclas |
- | - |
lowscore | LOW_SCORE | --lowsc |
- | - |
biasdseq | BIASED_SEQUENCE | --biasfrac |
- | - |
unjoinbl | UNJOINABLE_SUBSEQ_ALIGNMENTS | none | - | |
deletina | DELETION_OF_FEATURE | --ignore_isdel |
all | - |
ambgntrp | N_RICH_DELETION_OF_FEATURE | --r_diffno , --r_diffmaxdel , --r_diffmaxins , --r_diffminnonn , --r_diffminfract |
all | - |
fstlocft | POSSIBLE_FRAMESHIFT_LOW_CONF | --fstlothr , --fstminntt |
CDS | - |
fstlocfi | POSSIBLE_FRAMESHIFT_LOW_CONF | --fstlothr , --fstminnti |
CDS | - |
indf5lcc | INDEFINITE_ANNOTATION_START | --indefann , --indefann_mp |
CDS and any gene or mat_peptide with identical start coordinate to a CDS | - |
indf3lcc | INDEFINITE_ANNOTATION_END | --indefann , --indefann_mp |
CDS and any gene with identical stop coordinate to CDS | - |
insertnn | INSERTION_OF_NT | --nmaxins |
CDS | - |
deletinn | DELETION_OF_NT | --nmaxdel |
CDS | - |
lowsim5c | LOW_FEATURE_SIMILARITY_START | --lowsim5ftr |
CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide | - |
lowsim3c | LOW_FEATURE_SIMILARITY_END | --lowsim3ftr |
CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide | - |
lowsimic | LOW_FEATURE_SIMILARITY | --lowsimiftr |
CDS, mat_peptide and any feature with identical coordinates to a CDS or mat_peptide | - |
ambgnt5f | AMBIGUITY_AT_FEATURE_START | none | - | - |
ambgnt3f | AMBIGUITY_AT_FEATURE_END | none | - | - |
ambgnt5c | AMBIGUITY_AT_CDS_START | none | CDS | - |
ambgnt3c | AMBIGUITY_AT_CDS_END | none | CDS | - |
ambgcd5c | AMBIGUITY_IN_START_CODON | none | CDS | - |
ambgcd3c | AMBIGUITY_IN_STOP_CODON | none | CDS | - |
It is possible to specify that certain features are expendable and so
have relaxed requirements. Some alerts that are normally fatal are not
fatal for expendable features. If any such alerts are reported for an
expendable feature that feature will be turned into a misc_feature
in the output feature table .pass.tbl
file, but the sequence will
still pass, as long as it has zero fatal alerts for all non-expendable
features and zero fatal sequence alerts.
The default set of specific alerts that an expendable feature can have
without failing its sequence are listed with 'yes' in the 'causes
misc_feature
, not failure (if in modelinfo file)' column in the
tables describing alerts above as well as in the
--alt_list
output. This set can be changed using the --alt_mnf_yes <s1>
option to specify that alert codes in the comma-separated string
<s1>
be added to the set, and the --alt_mnf_no <s2>
option to
specify that alert codes in the comma-separated string <s2>
be
removed from the set.
Expendable features are specified in the .modelinfo
file, with a
key/value pair string: misc_not_feature:"1"
in the FEATURE
line
for the corresponding feature.
For example, the sequence JN975492.1
is the one sequence in the
example above that fails. It matches best to the
NC_008311
model. It fails due to the fatal alerts mutendcd
,
cdsstopn
, cdsstopp
, and indf3pst
for the VF1
CDS feature, and
indf5pst
fatal alert for the VP2
CDS as shown
above. If the VF1
and VP2
features were defined
as expendable using the misc_not_failure:"1"
key/value pair in the
.minfo
file as they are in the included example file vadr.mnf-example.minfo
, then
the sequence would have passed.
The relevant excerpt from the
$VADRSCRIPTSDIR/documentation/annotate-files/vadr.mnf-example.minfo
file:
FEATURE NC_008311 type:"gene" coords:"5069..5710:+" parent_idx_str:"GBNULL" gene:"ORF4" misc_not_failure:"1"
FEATURE NC_008311 type:"CDS" coords:"5069..5710:+" parent_idx_str:"GBNULL" gene:"ORF4" product:"VF1" misc_not_failure:"1"
FEATURE NC_008311 type:"gene" coords:"6681..7307:+" parent_idx_str:"GBNULL" gene:"ORF3" misc_not_failure:"1"
FEATURE NC_008311 type:"CDS" coords:"6681..7307:+" parent_idx_str:"GBNULL" gene:"ORF3" product:"VP2" misc_not_failure:"1"
Note that in addition to the two CDS features, the two gene features that correspond them also have
misc_not_failure:"1"
key/value pairs. When a CDS is made expendable,
it often makes sense to make any corresponding gene features
expendable too. However, gene features are an exception
in that they do not get turned into a misc_feature
if they have
alerts that are normally fatal, as per GenBank convention, but
it is still relevant to mark them as expendable because some alerts
in them will not cause the sequence to fail.
To rerun the example using this new .minfo
file, execute:
v-annotate.pl -i $VADRSCRIPTSDIR/documentation/annotate-files/vadr.mnf-example.minfo $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-mnf-noro.9
The output will indicate that all sequences now pass:
# Summary of classified sequences:
#
# num num num
#idx model group subgroup seqs pass fail
#--- --------- --------- -------- ---- ---- ----
1 NC_008311 Norovirus GV 2 2 0
2 NC_029645 Norovirus GIII 2 2 0
3 NC_039477 Norovirus GII 2 2 0
4 NC_044854 Norovirus GI 2 2 0
5 NC_001959 Norovirus GI 1 1 0
#--- --------- --------- -------- ---- ---- ----
- *all* - - 9 9 0
- *none* - - 0 0 0
#--- --------- --------- -------- ---- ---- ----
But the output .pass.tbl
will not include the VF1
and VP2
CDS features for JN975492.1
, instead it will include misc_feature
features with note
qualifiers that indicate the regions are
similar to
their respective CDS:
Relevant excerpt from va-mnf-noro.9.vadr.pass.tbl
:
5044 5685 gene
gene ORF4
5044 5685 misc_feature
note similar to VF1
6656 7282 gene
gene ORF3
6656 7282 misc_feature
note similar to VP2
Note that if only the VF1
CDS or VP2
CDS feature lines included
the misc_not_failure:"1"
key/value pairs in the modelinfo file,
the sequence would have failed.
Two important caveats above expendable features and misc_feature-ization:
-
As mentioned above, features with type
gene
,5'UTR
,3'UTR
oroperon
are never converted tomisc_feature
values as per GenBank convention. -
misc_feature
-ization occurs in.pass.tbl
output files for expendable features as explained above even when the option--nomisc
is used. (The--nomisc
option causesmisc_feature
s not to be reported in.fail.tbl
files.)
The v-annotate.pl
script, in particular the alignment step, is memory intensive.
For Norovirus and Dengue virus, it is recommended to
have 16G of RAM available. For larger viruses, such as the roughly
30Kb SARS-CoV-2 virus, 64G of available RAM is recommended. However,
the --glsearch
and --split
options can be used to reduce the
memory requirements.
The --glsearch
option causes the glsearch
program from the FASTA
software package to be used instead of Infernal's memory intensive
cmalign
program. However, --glsearch
has only been extensively
tested for SARS-CoV-2 sequences, for which it is now recommended due
to the high 64G memory recommendation with cmalign
.
With --glsearch
the amount of required memory is roughly 2G of RAM
for small input fasta files with 2000 sequences or less, but can
exceed 2G for very large input files. Required memory will increase
with the size of the input file.
Using the --split
option removes the dependence of required
memory on input file size as it causes splitting of the input fasta file
into independent chunks with each chunk processed separately and
results from all chunks combined at the end.
Also, in combination with the --glsearch
and --split
options, the
user can specify multi-threading with <n>
CPUs by using the --cpu <n>
option. It is recommended that at least 2G * <n>
total RAM is
available when using this option.
In summary, the following combination of options are recommended to
reduce memory usage and speed-up processing for
SARS-CoV-2 annotation, provided you are running on a machine with 8
available cores and 16G of total RAM: --glsearch --split --cpu 8
.
For more information on SARS-CoV-2 annotation with VADR see https://github.com/ncbi/vadr/wiki/Coronavirus-annotation
Alternatively, if you have access to a cluster and want to parallelize
but do not want to use glsearch
, you can use the -p
option. Importantly, the -p
option will not work with --glsearch
and will not reduce the memory requirements like --glsearch
does.
Using -p
will parallelize the most time-consuming stages of v-annotate.pl
(classification,
coverage determination and alignment) on a cluster
by splitting up the input sequence file randomly into multiple files,
and running each as a separate job. This is most beneficial for large
input sequence files.
With -p
, by default, v-annotate.pl
will consult the file
$VADRSCRIPTSDIR/vadr.qsubinfo
to read the command prefix and suffix
for submitting jobs to the cluster. This file is set up to use Univa
Grid Engine (UGE 8.5.5) and specific flags used on the NCBI system,
but you can either modify this file to work with your own cluster or
create a new file <s>
and use the option -q <s>
to read that file.
The $VADRSCRIPTSDIR/vadr.qsubinfo
has comments at the top that
explain the format of the file. Email eric.nawrocki@nih.gov for help.
To repeat the above v-annotate.pl
run in the example usage section, use this command:
v-annotate.pl -p $VADRSCRIPTSDIR/documentation/annotate-files/noro.9.fa va-parallel-noro.9
Usage of -p
will not affect the output of v-annotate.pl
other than
these lines about the status of jobs, but it can make processing of
large sequence files significantly faster depending on how busy the
cluster is.