Skip to content

Commit

Permalink
add taxonomy data products; attempt for image
Browse files Browse the repository at this point in the history
  • Loading branch information
hariszaf committed Jul 26, 2023
1 parent 0d261eb commit 46c8c10
Show file tree
Hide file tree
Showing 2 changed files with 110 additions and 6 deletions.
107 changes: 105 additions & 2 deletions docs/data_products.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ Description of ``metaGOflow``'s data products
Quality filtering step
-----------------------

- ```*.fastq.trimmed.fasta`` **files**
- ``*.fastq.trimmed.fasta`` **files**
Filtered .fasta files of the forward (R1) and reverse (R2) reads. Its content strongly depends on the
``fastp``-related `:doc:/args_and_params` parameters.
``fastp``-related :doc:`/args_and_params` parameters.
A record in a .fasta file consists of 2 parts: a *header* that always starts with a ``>``` and describes
the sequence (experiment id, coordinates etc.) and the sequence.
Example:
Expand Down Expand Up @@ -70,6 +70,109 @@ This file is necessary for running the `mOTUs package <https://github.com/motu-t



Taxonomy inventory step
------------------------

- ``*.merged.motus.tsv`` **file**
A three column file with the mOTUs found, their taxonomic assignment and their abundance:

.. code-block:: bash
#mOTU consensus_taxonomy count
meta_mOTU_v25_13231 k__Archaea|p__Euryarchaeota|c__Euryarchaeota class incertae sedis|o__Euryarchaeota order incertae sedis|f__Euryarchaeota fam. incertae sedis|g__Euryarchaeota gen. incertae sedis|s__uncultured Candidatus Thalassoarchaea euryarchaeote 12
- ``RNA-counts`` **file**

A file with the number of the LSU and SSU counts on the sample:

.. code-block:: bash
LSU count 709
SSU count 475
- ``*.merged_LSU.fasta.mseq.gz`` and ``*.merged_SSU.fasta.mseq.gz`` **files**

Compressed files with rRNA sequences used for taxonomic indentification along with their hits and scores.
The decompressed files consist of 13 columns with the taxonomy assignment in the last one.

.. code-block:: bash
#query dbhit bitscore identity matches mismatches gaps query_start query_end dbhit_start dbhit_end strand SILVA
V1:1:HWLTKDRXY:1:2276:10818:25551-1-merged-143-11-LSU_rRNA_eukarya/q53-152 GEAN01107426.394.3747 98 0.9900000095367432 99 1 0 0 100 2246 2346 + sk__Eukaryota;k__Metazoa;p__Arthropoda;c__Hexanauplia;o__Calanoida;f__Temoridae;g__Eurytemora;s__Eurytemora_affinis
V1:1:HWLTKDRXY:1:2247:17598:35540-1-merged-151-107-LSU_rRNA_bacteria/q1-253 CP000828.5638205.5641084 163 0.8589743375778198 201 32 1 0 233 26 260 + sk__Bacteria;k__;p__Cyanobacteria;c__;o__Synechococcales
- ``*.merged_LSU.fasta.mseq.tsv`` and ``*.merged_SSU.fasta.mseq.tsv`` **files**

Abundance tables consisting of 4 columns mentioning the OTU id and the taxonomic assignment of each.
In addition, the NCBI Taxonomy Id of each assignment is mentioned in the last column.


.. code-block:: bash
# Constructed from biom file
# OTU ID LSU_rRNA taxonomy taxid
1039 4.0 sk__Archaea;k__;p__Euryarchaeota;c__Thermoplasmata 183967
3616 46.0 sk__Bacteria 2
30206 2.0 sk__Bacteria;k__;p__Bacteroidetes;c__Bacteroidia 200643
12319 1.0 sk__Bacteria;k__;p__Bacteroidetes;c__Bacteroidia;o__Marinilabiliales;f__Marinifilaceae 1573805
- ``*.merged_LSU.fasta.mseq.txt`` and ``*.merged_SSU.fasta.mseq.txt`` **files**

Like the ``*.fasta.mseq.tsv`` files but without the head columns and keeping only the abundance and the taxonomy columns, splitting
the latter to its taxonomic levels.


.. code-block:: bash
4 sk__Archaea k__ p__Euryarchaeota c__Thermoplasmata
46 sk__Bacteria
2 sk__Bacteria k__ p__Bacteroidetes c__Bacteroidia
1 sk__Bacteria k__ p__Bacteroidetes c__Bacteroidia o__Marinilabiliales f__Marinifilaceae
These files are used as input to build the Krona plots.


- ``*.fasta.mseq_json.biom`` **files**

The output of the MAPseq classification as json in a biom format



- ``*.fasta.mseq_json.biom`` **files**

The biom format is based on HDF5 to provide the overall structure for the format.
`HDF5 <https://www.hdfgroup.org>`_ is a widely supported binary format with native parsers available within many programming languages.




- ``krona.html`` **files**


A hierarchical visual component of the taxonomic profile based on the LSU and the SSU accordingly.


.. image:: images/krona.png
:width: 850




Gene prediction step
--------------------


Functional annotation step
--------------------------


Assembly step
-------------



Expand Down
9 changes: 5 additions & 4 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,7 @@ Raw data
The sequences file can be provided to ``metaGOflow`` directly or an ENA accession id of the run of intereste can be provided and
``metaGOflow`` will fetch the data automatically.


Fill in the ``config.yml`` file and set the parameters as described in the :doc:`/args_and_params`.
.. attention:: ``metaGOflow`` is not valid for the analysis of long reads samples


Run ``metaGOflow``
Expand All @@ -23,6 +22,8 @@ Assuming ``metaGOflow`` is about to perform in a HPC environment where `Singular
and that we have built a ``conda`` environment as shown in :doc:`/installation`
let's break down how we would execute a run given the ``config.yml`` is set.

About the ``config.yml`` file and how to set the parameters on it, you may see the :doc:`/args_and_params` section.


.. code-block:: bash
Expand Down Expand Up @@ -117,11 +118,11 @@ In the same place, the output of the assembly step (``final.contigs.fa``) will b
* - ``*.merged.fasta``
- Merged filtered sequences
* - ``*.merged.motus.tsv``
- Merged sequences MOTUs
- mOTUs along with their taxonomic assignment and their abundance
* - ``*.merged.qc_summary``
- Quality control (QC) summary of the merged sequences
* - ``*.merged.unfiltered_fasta``
- Merged sequences that did not pass the filtering
- Merged sequences with clean headers
* - ``fastp.html``
- FASTP analysis of raw sequence data
* - ``final.contigs.fa``
Expand Down

0 comments on commit 46c8c10

Please sign in to comment.