Skip to content

Latest commit

 

History

History
438 lines (366 loc) · 24.7 KB

README_FULL.md

File metadata and controls

438 lines (366 loc) · 24.7 KB

GENERATE CUSTOM DATABASE

This file describes the software used to generate the GOTTCHA databases and step-by-step guidiance.


General Introduction

The GOTTCHA databases are generated with the "gottcha_db.pl" program, but requires as input, two pre-formatted Perl Storable-formatted files and an custom XML file describing the sequences to be included. These files are generated by the two included perl scripts, "mkGottchaTaxTree.pl" and "mkGottchaXML.pl". Pre-computed databases in FASTA format are available for both bacteria and viruses at the strain, species, genus, family, order, class, and phylum levels, each. Pre- computed Eukaryotic databases are not yet available, but can be computed using the provided tools.

These tools are primarily Perl-based, and require at least Perl v5.8 and the following Perl modules:

  • Algorithm::Combinatorics qw(variations_with_repetition)
  • Benchmark ':hireswallclock'
  • Bit::Vector
  • Cwd qw(abs_path)
  • Data::Dumper
  • Devel::Size qw(size total_size)
  • Fcntl qw(:flock SEEK_END)
  • feature 'switch'
  • File::Basename
  • File::Copy
  • File::Path
  • Getopt::Long
  • IO::Handle
  • Sort::Key qw(keysort)
  • Sort::Key::Radix qw(rnkeysort ukeysort usort ssort)
  • Statistics::Descriptive
  • Storable qw(store retrieve)
  • threads
  • threads::shared
  • Tie::IxHash
  • Time::HiRes
  • XML::Simple
  • YAML::XS

The splitrim tool is written in D (http://www.dlang.org) and requires an appropriate D compiler (see below).


Local directory structure and sequence files

You will need local copies of NCBI's genome sequences and their associated GenBank files. Ideally, for bacteria, you would transfer the *.fna and *.gbk files of (currently) ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria to your local disk, yielding, say, the destination directory of "/localdir/ftp.ncbi.nih.gov/genbank/ genomes/Bacteria". Since different GI numbers exist for the same genome projects depending on whether you're in the "ftp.ncbi.nih.gov/genbank/genomes" directory or the "ftp.ncbi.nih.gov/genomes" directory, you should also download the *.gbk files of ftp://ftp.ncbi.nih.gov/genomes/Bacteria to local disk. You should then have two directories:

    /localdir/ftp.ncbi.nih.gov/genomes/Bacteria             [*.gbk files only]
    /localdir/ftp.ncbi.nih.gov/genbank/genomes/Bacteria     [*.gbk and *.fna files]

Virus sequences are located in ftp://ftp.ncbi.nih.gov/genomes/Viruses, and you should download the *.fna, *.ffn, and *.gbk files, so that your local directory should now have the following:

    /localdir/ftp.ncbi.nih.gov/genomes/Viruses              [*.gbk,*.fna,*.ffn files]
    /localdir/ftp.ncbi.nih.gov/genomes/Bacteria             [*.gbk files only]
    /localdir/ftp.ncbi.nih.gov/genbank/genomes/Bacteria     [*.gbk and *.fna files]

The included Perl scripts will traverse the directories you specify, looking for the *.fna files, so be sure that a *.gbk GenBank file exists for each that you want to include. This should not be a problem so long as your FTP transfer wasn't prematurely interrupted.

A list of all the GenBank files corresponding to the sequences that will be used for signature generation needs to be created for the next tool. The full path file is listed, one per line, in the list. Two lists should be generated if both "genomes" and "genbank/genomes" GIs are desired in the final result (highly suggested). The following commands should be sufficient:

    $ find /localdir/ftp.ncbi.nih.gov/genomes/Viruses  -name \*.gbk > genomes_viruses.txt
    $ find /localdir/ftp.ncbi.nih.gov/genomes/Bacteria -name \*.gbk > genomes_bacteria.txt
    $ find /localdir/ftp.ncbi.nih.gov/genbank/genomes/Bacteria -name \*.gbk > genbank.txt

Then concatentate the "genomes_viruses.txt" and the "genomes_bacteria.txt" files to generate "genomes.txt" that is used, along with the "genbank.txt" file as input into the next step.


Generating the taxonomic tree and genome vitals references (mkGottchaTaxTree.pl)

The "mkGottchaTaxTree.pl" file will generate TWO required custom-formatted files for the "gottcha.pl" program: the "speciesTreeGI.dmp" and "genomeVitals.dmp" files. Both are Perl Storable-formatted files (see http://search.cpan.org/~ams/Storable/Storable.pm ). In order to run "mkGottchaTaxTree.pl", first you will need to download three NCBI taxonomy dump files from ftp://ftp.ncbi.nih.gov/pub/taxonomy directory. The "taxdump.tar.gz" file will provide the "names.dmp" and "nodes.dmp" files, while the "gi_taxid_nucl.dmp.gz" will provide the "gi_taxid_nucl.dmp" file. You will also need the "genomes.txt" and "genbank.txt" files from the previous step to continue. After decompressing these files, you specify them on the command line as follows:

    $ perl mkGottchaTaxTree.pl --names=/path-to-file/names.dmp                  \
                             --nodes=/path-to-file/nodes.dmp                  \
                             --gi2taxid=/path-to-file/gi_taxid_nucl.dmp       \
                             --genbank=/path-to-file/genbank.txt              \
                             --genomes=/path-to-file/genomes.txt              \
                             --threads=<#cores>                               \

NOTE: Make sure specify the full path to the appropriate files here (do not use "~").


Building the XML input file (mkGottchaXML.pl)

All the sequences used as input for signature generation are organized into a customized XML file created by the "mkGottchaXML.pl" command. Each genome project provided will need to be discoverable in both the "speciesTreeGI.dmp" and "genomeVitals.dmp" files. The full path to each genome project directory is added to a text file. Draft genomes are distinguishable from completed genomes, so if both completed and draft genomes are used, they should be separated into two files, such as "drafts.txt" and "completed.txt". Genomes need to be distinguished by headers describing the kingdom/domain of origin: [PROKARYOTES], [VIRUSES], [EUKARYOTES] or [OTHERS]. Both the "draft.txt" and "completed.txt" files are of the same format.

If you have two genome project directories for the same organism, as is the case of the "genbank/genomes" and "genomes" issue described above, only specify the directories of genome project that contains the (*.fna) sequence files, and NOT both. So if the "genbank/genomes" directory /localdir/ftp.ncbi.nih.gov/genbank/ genomes/Bacteria/Acinetobacter_baumannii_AYE_uid28921 has the relevant sequence files, use this project directory rather than the "genomes" directory /localdir/ ftp.ncbi.nih.gov/genomes/Bacteria/Acinetobacter_baumannii_AYE_uid61637. If both have the sequence files, choose only one!

Note that if you include genome sequencing projects of the same strain -- even if from only the "genbank/genomes" directory -- it is highly likely that both genomes will cancel each other out and there will be NO resulting signatures. Take these two E. coli genome projects for example:

Escherichia_coli_BL21_DE3__uid20713/
Escherichia_coli_BL21_DE3__uid28965/
or
Escherichia_coli_DH1_uid30031/
Escherichia_coli_DH1_uid52077/

The following in a generic example of the "completed.txt" file containing genomic sequences from multiple sources.

[VIRUSES]
/localdir/ftp.ncbi.nih.gov/genomes/Viruses/Thielaviopsis_basicola_mitovirus_uid37715
/localdir/ftp.ncbi.nih.gov/genomes/Viruses/Thogoto_virus_uid15043
/localdir/ftp.ncbi.nih.gov/genomes/Viruses/Thottapalayam_virus_uid29841
/localdir/ftp.ncbi.nih.gov/genomes/Viruses/Thrush_coronavirus_HKU12_600_uid32701
/localdir/ftp.ncbi.nih.gov/genomes/Viruses/Thunberg_fritillary_virus_uid15483
/localdir/ftp.ncbi.nih.gov/genomes/Viruses/Tick_borne_encephalitis_virus_uid1533 5 /localdir/ftp.ncbi.nih.gov/genomes/Viruses/Tiger_puffer_nervous_necrosis_virus_uid41607
/localdir/ftp.ncbi.nih.gov/genomes/Viruses/Venezuelan_equine_encephalitis_virus_uid15302
/localdir/ftp.ncbi.nih.gov/genomes/Viruses/Yersinia_pestis_phage_phiA1122_uid14332
[PROKARYOTES]
/localdir/ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_anthracis_Ames_uid309
/localdir/ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_anthracis_CDC_684_uid31329
/localdir/ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_anthracis_CI_uid36309
[EUKARYOTES]
/localdir/ftp.ncbi.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128
/localdir/ftp.ncbi.nih.gov/genomes/Fungi/Cryptococcus_neoformans_var_JEC21_uid10698

NOTE: If you receive an "Unrecognized directory format" error, try removing the trailing forward slashes "/" in these directory names.

If you would like to have completely separate signature databases for each domain -- such as one for Prokaryotes only -- then only populate the [PROKARYOTES] header. The other headers need not be provided if they do not refer to any genome projects. The process will then be repeated for each individual kingdom/domain to generate specific XML files, such as "prokaryotes.xml", "eukaryotes.xml", and "viruses.xml".

The mkGottchaXML.pl file is executed with the following parameters:

    $ perl mkGottchaXML.pl --drafts=/path-to-file/drafts.txt                    \
                         --completed=/path-to-file/completed.txt              \
                         --taxTree=/path-to-file/speciesTreeGI.dmp            \
                         --vitals=/path-to-file/genomeVitals.dmp              \
                         --prefix=<XML filename prefix>                       

The RNA viral genomes will likely not have an associated *.fna file, so this program will automatically attempt to extract this sequence from the associated *.gbk file. If you do not want this to happen, add the "--noDenovoFNA" parameter. If the appropriate GI could not be found, the program dies. To prevent this, use the "--skipOrgNoGI" option and the offending organism will not be included in the XML output file. If the appropriate TAXID could not be found for any particular organism, the program will die. To prevent this, specify the "--continueNoMatch" option.


Running the GOTTCHA database generator (gottcha_db.pl)

This is the resource-intensive tool that, as of Dec 31, 2012 required a 2TB RAM machine (distributed across all 80 available cores) to process all ~2500 prokaryote genome projects. RAM usage can be decreased by reducing the # of cores used, but will also take longer. The GOTTCHA databases used in the submitted publication were created with the following parameters:

    $ perl gottcha_db.pl --taxIdxFile=/path-to-file/PROKARYOTES.xml                \
                    --wordSize=24                                             \
                    --prefixLength=2                                          \
                    --minUniqueFragLength=24                                  \
                    --taxLevel="SPECIES"                                      \
                    --compareSameTaxLevel                                     \
                    --doCompleteTax                                           \
                    --taxTreeFile=/path-to-file/speciesTreeGI.dmp             \
                    --pDir=/path-to-dir1,/path-to-dir2,/path-to-dir3          \
                    --outdir=/path-to-output-dir                              \
                    --dbLabel="PROKARYOTES"                                   \
                    --skipGC                                                  \
                    --nThreads=80                                             \

The wordSize is equivalent to the k-mer length used for finding exact matches. To improve efficiency, the k-mers are broken down into prefixes and suffixes, and a prefixLength of 2 is safe. The minimum length of signatures to report is specified with the "--minUniqueFragLength" parameter. Since this tool is resource- intensive and IO-bound, it is best to split the processing up among as many local drives (having separate controllers) as possible. You can specify as many partition directories with the "--pDir" command, separating each with a comma (no spaces). The "--outdir" must have sufficient space to report all signatures for all levels of taxonomy; ensure this is at least 250GB to be safe. Determining the %GC of the fragments currently slows the processing time down considerably and is HIGHLY recommended to be bypassed with the "--skipGC" command.

Signatures will be reported for each organism at each taxonomic level as individual FASTA files in the appropriate /outdir/fasta/ directory. It is up to you to concatenate them into a single, say, species FASTA file and format them for use with your favorite aligner.

ONCE THE GOTTCHA SIGNATURE DATABASES HAVE BEEN CREATED, METAGENOMIC SEQUENCE READS FROM SOME UNKNOWN SAMPLE CAN BE MAPPED TO THE DATABASE USING YOUR FAVORITE ALIGNER. THE FOLLOWING DESCRIBES THE WORKFLOW BY WHICH THE COMMUNITY COMPOSITION IS DESCRIBED IN TERMS OF ORGANISM IDENTITY AND RELATIVE ABUNDANCE.


Split-trimming the sequence reads (splitrim)

The GOTTCHA workflow currently relies on exact matching of sequence reads to the database of signatures. In order to maximize read recruitment to the signatures, each read in the FASTQ file is split into non-overlapping 30-mers before mapping to the database.

To compile the D sourcecode (http://www.dlang.org), use the following command for DMD:

     $ dmd -O -release -inline splitrim.d

The DMD compiler is freely available at http://dlang.org/dmd-linux.html. Alternatively, the GNU D Compiler (gdc) is available at https://github.com/D-Programming-GDC/GDC, while the LLVM-based D compiler (ldc) project is found at https://github.com/ldc-developer/ldc.

The splitting first occurs on every base whose quality is below the --minQ value specified (default: 20). The remaining fragments are then split into non- overlapping 30-mers as best as possible. If a fragment is between 31-59 bases (less than what it takes to generate two separate 30-bp fragments), it will be retained only if the --recycle option is specified. Although there is a --threads option, best performance is observed between 1-2 threads only.

     $ splitrim --inFile=/path-to-file/INSEQ.fastq --fixL=30 --recycle --ascii=33 \
         --minQ=20 --threads=1 --outPath=/path-to-splitrim-file/TRIM.fastq

Mapping split-trimmed reads to the GOTTCHA database

Mapping the split-trimmed file to the GOTTCHA database is amenable to almost any aligner that takes a FASTQ file as input and produces a SAM file. The typical BWA command (version 0.7.5a-r416) we use is of the form:

     $ bwa mem -k 30 -T 0 -B 100 -O 100 -E 100 -t $THREADS                \
            /path-to-GOTTCHA-db/gottcha                                   \
            /path-to-trimmed-FASTQ-file/TRIM.fastq                        \
          > /path-to-SAM-file/SAMFILE.sam

Profiling the GOTTCHA results (profileGottcha.pl)

You should first create a pre-parsed version of the GOTTCHA FASTA database file for faster subsequent runs, rather than forcing the program to repeatedly parse the flatfile upon every execution. You generate the parsed version of the database with the following command:

     $ perl /path-to-file/profileGottcha.pl                               \
        --db=/path-to-GOTTCHA-FASTA-file/gottcha.fna                      \
        --make_dmp --prefix=MYPREFIX

which will generate the file "MYPREFIX.parsedGOTTCHA.dmp". Using this parsed file will speed up all subsequent runs. The program currently only parses SAM files, which is specified with --sam. Rather than specifying a single SAM file as input, the program expects as input a file containing a list of the SAM files to parse (one file per line and includes the full path). This allows you to include multiple SAM files into a single analysis.

    $ perl /path-to-file/profileGottcha.pl                                \
        --parsedDB=/path-to-file/MYPREFIX.parsedGOTTCHA.dmp               \
        --map=/path-to-list-of-SAM-files/LIST.txt --sam                   \
        --outdir=/path-to-outfiles/OUTDIR --prefix=PROFILE_PREFIX         \
        --trimStats=/path-to-stats-infile/STATS_INFILENAME.txt            \
        --treeFile=/path-to-file/speciesTreeGI.dmp                        \
        --genomeVitals=/path-to-file/genomeVitals.dmp                     \
        --noFastqOut --method=1

The option --noFastqOut suppresses the writing of the (trimmed) reads that mapped into a "MAPPED" file and those that did not map to an "UNMAPPED" output FASTQ file. The --treeFile and --genomeVitals files are those that were used earlier in the pipeline. The --trimStats file is that which is produced by the "splitrim" program above and is used to report a few additional statistics, but is not required.

Primary analysis files produced from this run include:

  • PREFIX.replicon.tsv
  • PREFIX.strain.tsv
  • PREFIX.species.tsv
  • PREFIX.genus.tsv
  • PREFIX.family.tsv
  • PREFIX.order.tsv
  • PREFIX.class.tsv
  • PREFIX.phylum.tsv

Description:

Column Description
RANKNAME (STRAIN, SPECIES...) replicon name (source + plasmid/chr)
NUM_SUBRANKS no. of distinct subranks for the current rank (E.g. the no. of SPECIES under the current GENUS)
GPROJ_ENTRIES no. of genome projects (i.e. STRAINS) under this RANK NAME
LINEAR_LENGTH N/O_LENGTH = non-overlapping length = no. of non-overlapping bases covering the unique DB
UNIQUE_DB_LENGTH no. of unique bases for this organism
FULL_REFDB_LENGTH no. of bases in full reference
LINEAR_COV (LINEAR_LENGTH / UNIQUE_DB_LENGTH)
HIT_COUNT no. of hits recruited to genome
TOTAL_BP_MAPPED sum total of all hit lengths recruited to genome
LINEAR_DOC linear depth-of-coverage = fold coverage of sample's LINEAR_LENGTH (TOTAL_BP_MAPPED / LINEAR_LENGTH)
UREF_DOC unique reference's depth-of-coverage = fold coverage of reference's UNIQUE_DB_LENGTH (TOTAL_BP_MAPPED / UNIQUE_DB_LENGTH)
UREF_CMAX MAX COVERAGE OF REFDB POSSIBLE, GIVEN SAMPLE INPUT BASES (TOTAL_INPUT_BASES / UNIQUE_DB_LENGTH)
FRAC_HITS_POSSIBLE (HIT_COUNT / TOTAL_INPUT_READS)
FRAC_BASES_POSSIBLE (TOTAL_BP_MAPPED / TOTAL_INPUT_BASES)
MEAN_HIT_LENGTH (TOTAL_BP_MAPPED / HIT_COUNT)
MEAN_LINEAR_HIT_LENGTH (LINEAR_LENGTH / HIT_COUNT)
best_[field name] the best value of this field in SUBRANKS

Other non-essential files:

  • PREFIX.giCoords.dmp
  • PREFIX.replicon.contig.coords.csv
  • PREFIX.replicon.contig.HistoByEntry.csv
  • PREFIX.replicon.contig.HistoByEntry.dmp
  • PREFIX.replicon.contig.HistoByGI.csv
  • PREFIX.replicon.contig.HistoByGI.dmp
  • PREFIX.replicon.reads.coords.csv
  • PREFIX.strain.contig.coords.tsv

You will be using the *.tsv primary analysis files. Each result is rolled up to the next highest taxonomy, regardless of the resolution of the GOTTCHA database used at the mapping step. For example, if you mapped your reads to the GENUS-level GOTTCHA database, you will still get the high resolution *.tsv files (replicon, strain, and species) but also the higher taxonomic rollup results (family, order, class, and phylum). It is important to note that the GENUS-level results you get for mapping reads to the GENUS-level GOTTCHA database will be different than the GENUS-level rollup results you get from mapping reads to the STRAIN-level GOTTCHA database. The difference lies in the number of signatures available at the higher taxonomic levels (GENUS > STRAIN), so more reads will be recruited to the GENUS- level database versus the STRAIN-level database, and this difference will be noted in the two different *.genus.tsv files genereated.


Filtering the GOTTCHA profile (filterGottcha.pl)

This script is used to determine the organisms that are present in your read set and does so by filtering out the false positives that may creep into your results. There are 3 stages of filters applied.

The first filter, --minCov, relies on the minimum linear coverage of the signatures present to proceed to the next filter. A comprehensive anlaysis of synthetic metagenomes have determined that a linear coverage between 0.1% - 0.5% provides the optimal range for this first filter.

The second pair of filters, --minMLHL and --cCov, attempt to remove read stacking on signatures that may result from background genomes (eukaryotic, perhaps) or some other novel genome that contain some features of the reference genome signatures. This pair of filters is based on the assumption that the sequencing technology will uniformly (and randomly) generate signatures so that reads should not significantly stack upon signatures, but rather spread out across them if the host organism is truly present. The minimum Mean Linear Hit Length (--minMLHL) is defined as the LINEAR LENGTH (total non-overlapping length of signatures covered) divided by the hit count to that signature length. Stacking will decrease this ratio, which MAY be a sign of a false positive. The other possibility is that the organism is present at high abundance and this ratio will necessarily decrease as more and more of the signatures are covered. Thus, if the minimum MLHL is below that specified by --minMLHL, the associated organism is removed as a false positive unless the linear coverage is above the critical coverage (--cCov).

The third pair of filters is based solely on a minimum number of hits and minimum length of signatures that must be covered in order for an organism to be considered present. These numbers are highly dependent on the amount of sequencing done. A minimum linear length (--minLen) of 100-bp and a minimum number of hits (--minHits) of 10 are conservative values for a single lane of Illumina HiSeq.

    $ perl /path-to-file/filterGottcha.pl                                 \
          --strain=/path-to-file/variantStrainLookup.dmp                  \
          --species=/path-to-file/variantSpeciesLookup.dmp                \
          --genus=/path-to-file/genusLookupBySpecies.dmp                  \
          --family=/path-to-file/familyLookupBySpecies.dmp                \
          --order=/path-to-file/orderLookupBySpecies.dmp                  \
          --class=/path-to-file/classLookupBySpecies.dmp                  \
          --phylum=/path-to-file/phylumLookupBySpecies.dmp                \
          --taxlookup=/path-to-file/taxLookupBySpecies.dmp                \
          --taxLevel=TAXLEVEL                                             \
          --dir=OUTDIR                                                    \
          --prefix=PREFIX                                                 \
          --method=1                                                      \
          --minCov=0.005                                                  \
          --minMLHL=5                                                     \
          --cCov=0.006                                                    \
          --minLen=100                                                    \
          --minHits=10

where TAXLEVEL="species", "genus", etc.

The results produced by this program will list the organism(s) at all taxonomic levels from STRAIN to PHYLUM, their linear length, total bases mapped, linear depth of coverage, and the normalized linear depth of coverage. The linear depth of coverage (DOC) is used to calculate the relative abundances of each organism or taxonomic name in the sample:

    Relative Abundance = 
     (linear DOC of organism X) / [sum of linear DOC of all organisms] of Organism X

The eight Perl Storable (*.dmp) files are produced by a rearrangement of the speciesTreeGI.dmp file used in earlier steps. They are generated with the "makeVariantTaxLookups.pl" script:

    $ perl /path-to-file/makeVariantTaxLookups.pl /path-to-file/speciesTreeGI.dmp