The CAMI2 databases were downloaded from https://data.cami-challenge.org/participate
It is important to note that the marine and strain madness datasets use the same CAMI ncbi databases.
Please use the following CAMI 2 challenge databases for the profiling and taxonomic binning challenge. (also listed on the Databases tab):
- Blast nr:
https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_blast/nr.gz
- Blast nt:
https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_blast/nt.gz
- NCBI Taxonomy:
https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_taxonomy.tar
- Accession to Taxid Mapping:
https://openstack.cebitec.uni-bielefeld.de:8080/swift/v1/CAMI_2_DATABASES/ncbi_taxonomy_accession2taxid.tar
REPO="$HOME/metaBenchmarks"
CAMI="${REPO}/data/databases/cami"
mkdir -p "${CAMI}/diamond"
diamond makedb \
--in "${CAMI}/nr.gz" \
--db "${CAMI}/diamond/nr_with_taxonomy" \
--taxonmap "${CAMI}/prot.accession2taxid.gz" \
--taxonnodes "${CAMI}/nodes.dmp" \
--taxonnames "${CAMI}/names.dmp"
Example output from
diamond makedb
command
Database sequences 184124794
Database letters 67182734252
Accessions in database 612278963
Entries in accession to taxid file 628458607
Database accessions mapped to taxid 611827407
Database sequences mapped to taxid 184018689
Database hash 26d9328ca8816352bccb879e4a97fd77
Total time 5354s
For parameter sweep benchmarking all entrypoints for use with Autometa, the NCBI, GTDB and markers databases are required.
For more information on setting up the necessary autometa databases (NCBI, GTDB & single-copy markers) see the Autometa databases documentation
REPO="$HOME/metaBenchmarks"
TOOL="mmseqs2"
TOOL_DBDIR="${REPO}/data/databases/${TOOL}"
mkdir -p $TOOL_DBDIR
cd $TOOL_DBDIR
mmseqs databases NR mmseqs2_NR NR_tmp --threads 20
REPO="$HOME/metaBenchmarks"
TOOL="kraken2"
TOOL_DBDIR="${REPO}/data/databases/${TOOL}"
mkdir -p $TOOL_DBDIR
cd $TOOL_DBDIR
kraken2-build --standard --threads 20 --db kraken2_db
You may need to download the required NCBI files prior to formatting diamond database if you have not already done so. If you have already downloaded these files and they are in sync with the rest of the tools' formatted databases, you may simply specify the locations of the files.
NOTE: Constructing the diamond-formatted database with the taxon mapping parameters is required for taxon-binning benchmarking, i.e. required with the
--outfmt 102
parameter.
e.g. diamond blastx --outfmt 102 ...
You may retrieve the respective files from here:
--in
Non-redundant (nr) database ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz--taxonmap
file prot.accession2taxid.gz- Contains
--taxonnodes
and--taxonnames
files taxdump.tar.gz
i.e. nodes.dmp, names.dmp, merged.dmp and delnodes.dmp - Found within
REPO="$HOME/metaBenchmarks"
TOOL="diamond"
TOOL_DBDIR="${REPO}/data/databases/${TOOL}"
mkdir -p $TOOL_DBDIR
cd $TOOL_DBDIR
diamond makedb \
--in "nr.gz" \
--db "nr_with_taxonomy" \
--taxonmap "prot.accession2taxid.gz" \
--taxonnodes "nodes.dmp" \
--taxonnames "names.dmp"