add taxonomy data products; attempt for image

emo-bon · Jul 26, 2023 · 46c8c10 · 46c8c10
1 parent 0d261eb
commit 46c8c10
Show file tree

Hide file tree

Showing 2 changed files with 110 additions and 6 deletions.
diff --git a/docs/data_products.rst b/docs/data_products.rst
@@ -7,9 +7,9 @@ Description of ``metaGOflow``'s data products
 Quality filtering step
 -----------------------
 
-- ```*.fastq.trimmed.fasta`` **files** 
+- ``*.fastq.trimmed.fasta`` **files** 
 Filtered .fasta files of the forward (R1) and reverse (R2) reads. Its content strongly depends on the 
-``fastp``-related `:doc:/args_and_params` parameters. 
+``fastp``-related :doc:`/args_and_params` parameters. 
 A record in a .fasta file consists of 2 parts: a *header* that always starts with a ``>``` and describes
 the sequence (experiment id, coordinates etc.) and the sequence. 
 Example:
@@ -70,6 +70,109 @@ This file is necessary for running the `mOTUs package <https://github.com/motu-t
 
 
 
+Taxonomy inventory step 
+------------------------
+
+- ``*.merged.motus.tsv`` **file**
+A three column file with the mOTUs found, their taxonomic assignment and their abundance:
+
+.. code-block:: bash
+
+    #mOTU	consensus_taxonomy	count
+    meta_mOTU_v25_13231	k__Archaea|p__Euryarchaeota|c__Euryarchaeota class incertae sedis|o__Euryarchaeota order incertae sedis|f__Euryarchaeota fam. incertae sedis|g__Euryarchaeota gen. incertae sedis|s__uncultured Candidatus Thalassoarchaea euryarchaeote	12
+
+
+- ``RNA-counts`` **file**
+
+A file with the number of the LSU and SSU counts on the sample:
+
+.. code-block:: bash 
+
+    LSU count	709
+    SSU count	475
+
+
+- ``*.merged_LSU.fasta.mseq.gz`` and ``*.merged_SSU.fasta.mseq.gz`` **files** 
+
+Compressed files with rRNA sequences used for taxonomic indentification along with their hits and scores. 
+The decompressed files consist of 13 columns with the taxonomy assignment in the last one. 
+
+.. code-block:: bash
+
+    #query	dbhit	bitscore	identity	matches	mismatches	gaps	query_start	query_end	dbhit_start	dbhit_end	strand		SILVA	
+    V1:1:HWLTKDRXY:1:2276:10818:25551-1-merged-143-11-LSU_rRNA_eukarya/q53-152	GEAN01107426.394.3747	98	0.9900000095367432	99	1	0	0	100	2246	2346	+		sk__Eukaryota;k__Metazoa;p__Arthropoda;c__Hexanauplia;o__Calanoida;f__Temoridae;g__Eurytemora;s__Eurytemora_affinis	
+    V1:1:HWLTKDRXY:1:2247:17598:35540-1-merged-151-107-LSU_rRNA_bacteria/q1-253	CP000828.5638205.5641084	163	0.8589743375778198	201	32	1	0	233	26	260	+		sk__Bacteria;k__;p__Cyanobacteria;c__;o__Synechococcales	
+
+
+
+- ``*.merged_LSU.fasta.mseq.tsv`` and ``*.merged_SSU.fasta.mseq.tsv`` **files**
+
+Abundance tables consisting of 4 columns mentioning the OTU id and the taxonomic assignment of each. 
+In addition, the NCBI Taxonomy Id of each assignment is mentioned in the last column. 
+
+
+.. code-block:: bash
+
+    # Constructed from biom file
+    # OTU ID	LSU_rRNA	taxonomy	taxid
+    1039	4.0	sk__Archaea;k__;p__Euryarchaeota;c__Thermoplasmata	183967
+    3616	46.0	sk__Bacteria	2
+    30206	2.0	sk__Bacteria;k__;p__Bacteroidetes;c__Bacteroidia	200643
+    12319	1.0	sk__Bacteria;k__;p__Bacteroidetes;c__Bacteroidia;o__Marinilabiliales;f__Marinifilaceae	1573805
+
+
+- ``*.merged_LSU.fasta.mseq.txt`` and ``*.merged_SSU.fasta.mseq.txt`` **files**
+
+Like the ``*.fasta.mseq.tsv`` files but without the head columns and keeping only the abundance and the taxonomy columns, splitting 
+the latter to its taxonomic levels. 
+
+
+.. code-block:: bash 
+
+    4	sk__Archaea	k__	p__Euryarchaeota	c__Thermoplasmata
+    46	sk__Bacteria
+    2	sk__Bacteria	k__	p__Bacteroidetes	c__Bacteroidia
+    1	sk__Bacteria	k__	p__Bacteroidetes	c__Bacteroidia	o__Marinilabiliales	f__Marinifilaceae
+
+These files are used as input to build the Krona plots. 
+
+
+- ``*.fasta.mseq_json.biom`` **files** 
+
+The output of the MAPseq classification as json in a biom format 
+
+
+
+- ``*.fasta.mseq_json.biom`` **files** 
+
+The biom format is based on HDF5 to provide the overall structure for the format. 
+`HDF5 <https://www.hdfgroup.org>`_ is a widely supported binary format with native parsers available within many programming languages.
+
+
+
+
+- ``krona.html`` **files**
+
+
+A hierarchical visual component of the taxonomic profile based on the LSU and the SSU accordingly. 
+
+
+.. image:: images/krona.png
+   :width: 850
+
+
+
+
+Gene prediction step 
+--------------------
+
+
+Functional annotation step 
+--------------------------
+
+
+Assembly step 
+-------------
 
 
 

diff --git a/docs/usage.rst b/docs/usage.rst
@@ -12,8 +12,7 @@ Raw data
 The sequences file can be provided to ``metaGOflow`` directly or an ENA accession id of the run of intereste can be provided and 
 ``metaGOflow`` will fetch the data automatically. 
 
-
-Fill in the ``config.yml`` file and set the parameters as described in the :doc:`/args_and_params`.
+.. attention:: ``metaGOflow`` is not valid for the analysis of long reads samples
 
 
 Run ``metaGOflow``
@@ -23,6 +22,8 @@ Assuming ``metaGOflow`` is about to perform in a HPC environment where `Singular
 and that we have built a ``conda`` environment as shown in :doc:`/installation` 
 let's break down how we would execute a run given the ``config.yml`` is set. 
 
+About the ``config.yml`` file and how to set the parameters on it, you may see the :doc:`/args_and_params` section.
+
 
 .. code-block:: bash
 
@@ -117,11 +118,11 @@ In the same place, the output of the assembly step (``final.contigs.fa``) will b
    * - ``*.merged.fasta``
      - Merged filtered sequences 
    * - ``*.merged.motus.tsv``
-     - Merged sequences MOTUs
+     - mOTUs along with their taxonomic assignment and their abundance
    * - ``*.merged.qc_summary``
      - Quality control (QC) summary of the merged sequences
    * - ``*.merged.unfiltered_fasta`` 
-     - Merged sequences that did not pass the filtering
+     - Merged sequences with clean headers
    * - ``fastp.html``
      - FASTP analysis of raw sequence data
    * - ``final.contigs.fa``