The notion of "OGU" (operational genomic unit) is the minimal unit for community ecology studies based on shotgun metagenome or other forms of whole-genome microbiome data. OGUs are simply the reference genomes to which input sequences are aligned. There is no need to assign taxonomy to them. This is in contrast to conventional practices, in which analyses are based on taxonomic units such as genera or species. In this sense, OGU is analogous to ASV in 16S rRNA studies.
The advantages of using OGUs include:
- Highest-possible resolution.
- Independent from taxonomy, which is coarse and can be error-prone as a classification system.
- Allowing for phylogeny-aware analyses such as Faith's PD and UniFrac. This is enhanced by our "Web of Life" (WoL) reference phylogeny, or any similar works.
The OGU analysis was explained, benchmarked and discussed in:
- Zhu Q, Huang S, Gonzalez A, McGrath I, McDonald M, Haiminen N, Armstrong G, et al. Phylogeny-aware analysis of metagenome community ecology based on matched reference genomes while bypassing taxonomy. mSystems. 2022. e00167-22.
To generate an OGU table, one needs a multiplexed alignment file, or a directory of per-sample alignment files. These files can be generated by aligning sequencing data against a reference genome database. For example:
while read $id; do
bowtie2 --very-sensitive -p 8 -x $db -1 $id.R1.fq -2 $id.R2.fq -S bt2out/$id.sam
done < sample.list
Here are details about using our WoL database and using Bowtie2 for alignment.
Then one can run Woltka to convert the alignment file(s) into an OGU table:
woltka classify -i bt2out -o table.biom
The output file table.biom
is a BIOM table with rows as genome IDs (OGUs), columns as sample IDs, and cell values as counts of OGUs in individual samples.
If needed, you may convert a BIOM table into a tab-delimited file:
biom convert --to-tsv -i table.biom -o table.tsv
Note: Qiita implements the WoL database and a Woltka workflow, which performs sequence alignment using the SHOGUN protocol. If you are a Qiita user, the alignment file as well as the OGU table (called a "per-genome" table) can be automatically generated and downloaded from the Qiita interface. See details.
The generated BIOM table can be imported into a QIIME artifact:
qiime tools import --type FeatureTable[Frequency] --input-path table.biom --output-path table.qza
These intermediate steps are automated if you use the QIIME 2 plugin of Woltka.
One can then investigate the microbiome by applying classical QIIME analyses on the OGU table. For example, with the WoL reference phylogeny (direct download link: tree.qza), one can do:
qiime diversity core-metrics-phylogenetic \
--i-phylogeny tree.qza \
--i-table table.qza \
--p-sampling-depth 10000 \
--m-metadata-file metadata.tsv \
--output-dir .
Most (if not all) QIIME 2 analyses designed for 16S rRNA data (ASV or OTU) also apply to OGUs. Please refer to the QIIME 2 documentation for tutorials and references.
It is quite common that one query sequence can be aligned to multiple reference genomes. Bowtie2 by default reports one hit per query. The SHOGUN protocol reports up to 16 hits. Other programs and protocols have their own ways of dealing with multiple hits.
In such cases, Woltka by default counts each OGU as 1 / k, where k is the total number of matching genomes.
Alternatively, one may choose to discard all non-unique matches, by adding a flag:
woltka classify --uniq -i input_dir -o output.biom
Technically, one can use any sequence aligners and reference genome databases to generate alignment files, which can then be converted into an OGU table. We cannot validate the goodness of outcome, but understand that you may have this intention considering the consistency with existing parts of your analytical pipeline. For examples:
bwa mem refseq.fna input.R1.fq input.R2.fq > output.sam
blastn -db refseq_genomes -query input.fa -max_target_seqs 16 -outfmt 6 -out output.txt
In multiple reference genome databases, subject sequences are individual nucleotide sequences (e.g., chromosomes or scaffolds) instead of whole genomes. In order to produce OGUs, one needs to supply Woltka with a sequence-to-genome mapping file (nucl2g.txt
, example provided under taxonomy/nucl
):
woltka classify -m nucl2g.txt -i input_dir -o output.biom
In some scenarios you may have a profile of genes, or ORFs (open reading frames), denoted as e.g. "G000123456_789" (meaning the 789th ORF of genome G000123456). This can be generated using Woltka's "coord-match" functional classification (see details), or through sequence alignment against genes (instead of genomes) (see details). You can extract genome IDs (i.e., OGUs) from ORF IDs using the collapse
command:
woltka collapse -i orf.biom -e -f 1 -o ogu.biom
- In this command,
-e
means that feature IDs are nested (such as the current case);-f 1
extracts the first field in the feature IDs.
Note however, that the resulting OGU table is not identical to that generated from the standard method. This is because gene-based methods automatically miss out intergenic regions, which are usually insignificant in microbial genomes, but can still make a difference.
When should you use this approach? For certain data types, such as metatranscriptomic data, which are naturally mRNA-derived, the gene-based approach may be more justified (although you still miss out unannotated genes and non-coding transcripts). You will need to decide based on the study goals.
Another more common scenario is when you want to perform a structural/functional stratification analysis. You start with ORFs, and care what these genes do and which source organisms they are from. See details about stratification.