-
Notifications
You must be signed in to change notification settings - Fork 7
1. Produce and annotate metagenomic assembly
If you want to follow along, you can use the data from Albertsen et al. 2013, which is in the folder example_data
. The file names in the following examples use the file names of the example data. (NB: The original header names of the assembly Fasta file were edited to be shorter.)
Use your favorite assemblier, like IDBA-UD or SPAdes, to assemble your metagenome. The assembly should be in a Fasta file. Calculate coverage with bbmap.sh by mapping the original reads used to assemble the metagenome back onto the assembly:
bbmap.sh ref=HPminus_assembly.fasta nodisk in=HPminus_reads.fq.gz covstats=HPminus.coverage
If you already have a SAM file with the mapping, you can use pileup.sh
(also from the BBmap suite of tools) directly:
pileup.sh in=mapping.sam out=coverage.table
BBmap is convenient because it calculates length and GC along with the coverage per scaffold, and outputs them all to the same file. If you use a different mapping tool, you will have to aggregate all these data together in a tab-separated table with the same header names as the bbmap output.
NOTE: Newer versions of BBmap (after 33.x) insert a comment character #
in the header line of the covstats file. Remove this character before running the scripts.
For differential coverage binning, you will need a second read library from a different sample where at least some of the genomes present in the first sample should also be present. Map the second read library onto the same assembly and save the coverage values as a different file:
bbmap.sh ref=HPminus_assembly.fasta nodisk in=HPplus_reads.fq.gz covstats=HPplus.coverage
Use AMPHORA2 or Phyla-AMPHORA to identify conserved marker genes in your assembly, and to assign a taxonomic position. Parse the output of their script Phylotyping.pl (here called phylotype.result
) for import into R:
perl parse_phylotype_result.pl -p phylotype.result > phylotype.result.parsed
Note: The script Phylotyping.pl
in the AMPHORA2 and Phyla-AMPHORA pipelines require RAxML and BioPerl, but there is an issue with older versions of BioPerl. Check your BioPerl version at the command line with
perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
This generates a file called phylotype.result.parsed
which will be imported into R.
An alternative is to Blast contigs directly against a database like NCBI nt and use the NCBI taxon IDs to assign a taxon to each contig. This is the approach used by [Blobology] (https://github.com/blaxterlab/blobology); one of their scripts has been modified to produce an output table compatible with genome-bin-tools
.
They require the NCBI nt Blast database and NCBI taxonomy dump, both of which are available from the NCBI FTP site (see the Blobology manual for instructions). Requires Blast+.
perl blob_annotate_mod.pl --assembly HPminus_assembly.fasta --blastdb /path/to/ncbi/nt --out contig_taxonomy.tab --num_threads 8 --taxdump /path/to/ncbi/taxdump/
The output file contig_taxonomy.tab
can be substituted in the subsequent workflow for phylotype.result.parsed
.
The convenience script provided with gbtools (look in the accessory_scripts folder) uses barrnap to detect SSU rRNA genes in the assembly, and assign phylotype by extacting sequences using fastaFrombed and then using Vsearch against a curated SILVA database.
The database has to be prepared in a specific way that is identical to the Vsearch-indexed database used by phyloFlash. Follow the instructions on the phyloFlash page for downloading and indexing the SILVA database, and point the script below to the location of the SSU database. If your phyloFlash folder is /PATH/TO/phyloFlash
, and the SILVA database version is 119, then the path to the database should be /PATH/TO/phyloFlash/119/SILVA_SSU.noLSU.masked.trimmed.fasta
.
phyloFlash is also a great tool; why not check it out for your metagenome QC needs? (Disclosure: I am also involved in the phyloFlash project).
The rRNA extraction, classification, and output parsing can be done with a wrapper script:
perl get_ssu_for_genome_bin_tools.pl -d <path/to/ssu/database> -c <number_CPUs> -a HPminus_assembly.fasta -o <output_prefix>
This generates a file called <output_prefix>.ssu.tab (in the example data, HPminus.ssu.tab) which will be imported into R.
Use tRNAscan-SE version 1.23 to find tRNA genes.
tRNAscan-SE -G -o HPminus.trna.tab HPminus_assembly.fasta
The output HPminus.trna.tab is directly imported into R.
Check that your input files are free of special characters that could cause problems, and that all the necessary fields are present.
perl input_validator.pl --covstats FILE1,FILE2,FILE3 \\ # Covstats files
--mark MARK1,MARK2 \\ # Marker taxonomy files
--ssu SSU \\ # SSU taxonomy file
--trna TRNA \\ # tRNA gene prediction file
If there are errors, the validator will report the first one in a file, for each file. There must be at least one coverage statistics table, because the validator uses the scaffold IDs in the coverage statistics file to check the scaffold IDs in all the other annotation files, too. Needless to say, you should only validate the files associated with a single metagenome at a time.