From c36a168d59d93619b488aec4770391e94946dc6e Mon Sep 17 00:00:00 2001
From: Haris Zafeiropoulos <haris.zafeiropoulos@kuleuven.be>
Date: Wed, 26 Jul 2023 17:37:49 +0200
Subject: [PATCH] add last functional annotation related files

---
 docs/data_products.rst | 107 +++++++++++++++++++++++++++++++++++++++++
 docs/usage.rst         |   4 +-
 2 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/docs/data_products.rst b/docs/data_products.rst
index 46e61e60..4c8ff442 100644
--- a/docs/data_products.rst
+++ b/docs/data_products.rst
@@ -162,18 +162,125 @@ A hierarchical visual component of the taxonomic profile based on the LSU and th
 
 
 
+- **Files** under the ``sequence-categorisation`` folder
+
+A list of compressed .fasta files  (:ref:`usage/sequence-categorisation`) of the same notion is returned under the `sequence-categorisation` folder. 
+Each file consists of the filtered and merged reads of the sample that are related to a specific `RNA family <https://rfam.org>`_.
+
+For example, the ``tmRNA.RF00023.fasta.gz`` includes reads that are related to the  
+transfer-messenger RNA (`RF00023 <https://rfam.org/family/RF00023>`_).
+
+
+
+
 
 Gene prediction step 
 --------------------
 
+- ``*.merged_CDS.ffn`` **file**
+
+Nucleotide coding sequences in a .fasta format,
+that correspond to coding genes as returned by `FragGeneScan <https://pubmed.ncbi.nlm.nih.gov/20805240/>`_.
+
+.. code-block:: bash
+
+    >SRR1620013.54-C038EACXX:5:1101:02684:02629-1-merged-101-1_3_101_-
+    GACAAGATCGACCGCATCATCGAGTTGTGCATCGCGCTGGAAGCGGACTTTGTTGAGCTCGCGACGTGCCAGTTCTACGGCTGGGCGCAGCTCAATCGT
+
+
+- ``*.merged_CDS.faa`` **file**
+
+.. code-block:: bash
+
+Aminoacid coding sequences that correspond to the coding genes in the ``*.merged_CDS.ffn`` file.
+
+    >SRR1620013.54-C038EACXX:5:1101:02684:02629-1-merged-101-1_3_101_-
+    DKIDRIIELCIALEADFVELATCQFYGWAQLNR
+
+
 
 Functional annotation step 
 --------------------------
 
 
+
+- ``*.merged_CDS.I5.tsv.gz`` **file**
+
+Main output of the InterPro annotation. 
+A compressed tab separated file consisting of 15 columns. 
+The ``protein_accession`` is the id with which the protein can be found in the samples' reads. 
+In the ``analysis`` column, it is mentioned which of the InterProScan analysis the entry is refferring to 
+(i.e., Pfam, TIGRFAM, PrositePatterns, ProSiteProfiles).
+In the ``go`` column, the corresponding Gene Ontology term is mentioned, 
+while in the last column ("``pathways_annotations``") annotations linked to the origingal, from resources such as MetaCYC, Reactome etc are mentioned. 
+
+.. code-block:: bash
+
+    protein_accession	sequence_md5_digest	sequence_length	analysis	signature_accession	signature_description	start_location	stop_location	score	status	date	accession	description	go	pathways_annotations
+    SRR1620013.24594-C038EACXX:5:1101:20780:152561-1-merged-101-9_1_108_-	e9cde5b71a9a05b6f5140c51a445a8f4	36	Pfam	PF00742	Homoserine dehydrogenase	3	36	3.3E-10	T	28-04-2023IPR001342	Homoserine dehydrogenase, catalytic	GO:0006520	MetaCyc: PWY-2941|MetaCyc: PWY-2942|MetaCyc: PWY-5097|MetaCyc: PWY-6160|MetaCyc: PWY-6559|MetaCyc: PWY-6562|MetaCyc: PWY-7153|MetaCyc: PWY-7977
+
+
+
+- ``*.merged.hmm.tsv.gz`` **file**
+
+Similarly to the ``*.merged_CDS.I5.tsv.gz`` file, this is the main output file of the HMMER annotation. 
+When decompressed, this tab separated files includes the HMM hits of the samples filtered reads to KEGG ORTHOLOGY terms
+along with their scores. 
+
+
+.. code-block:: bash
+    query_name	query_accession	tlen	target_name	target_accession	qlen	full_sequence_e-value	full_sequence_score	full_sequence_bias	#	of	c-evalue	i-evalue	domain_score	domain_bias	hmm_coord_from	hmm_coord_to	ali_coord_from	ali_coord_to	env_coord_from	env_coord_to	acc	description_of_target
+    SRR1620013.78392-C038EACXX:5:1103:10865:63862-1-merged-101-2_2_100_-	-	33	K00426	-	447	1.2e-09	36.2	0.1	1	1	1.6e-13	1.2e-09	36.1	0.1	136	168	1	33133	0.97	-
+
+
+
+
+- The ``*.merged.summary.*`` **files**
+
+Based on the ``*.merged.hmm.tsv`` and the ``*.merged_CDS.I5.tsv`` files, a list of summary files are returned 
+including resource-specific information. 
+All of them are 3 column tab separated files, including the annotation id, its description and the number of hits in the samples' reads.
+
+For example, the first lines of a ``*.merged.summary.pfam`` would be:
+
+.. code-block:: bash
+    
+    "26","PF00005","ABC transporter"
+    "11","PF00012","Hsp70 protein"
+    "8","PF00133","tRNA synthetases class I (I, L, M and V)"
+    "7","PF00361","Proton-conducting membrane transporter"
+
+where in the first column is the number of hits, in the second the Pfam id and in the third one its description.
+
+The ``*.merged.summary.go_slim``, ``*.merged.summary.ips``, ``*.merged.summary.ko`` and ``*.merged.emapper.summary.eggnog``
+have the same notion.
+
+
+- **Files** under the ``stats`` subfolder in the ``functional-annotation`` folder
+
+A list of text files including statistics about the number of matches with each annotation resource. 
+For example, 
+
+.. code-block:: bash 
+
+    user@server:~/my_analysis/results/functional-annotation/stats/$ cat ko.stats
+    Total KO matches	75
+    Predicted CDS with KO match	75
+    Reads with KO match	75
+
+
+
+
+
+
+
 Assembly step 
 -------------
 
+- ``final.contigs.fa`` **file**
+
+A .fasta file where each entry is a contig as returned from `MEGAHIT <https://doi.org/10.1093/bioinformatics/btv033>`_.
+
 
 
 
diff --git a/docs/usage.rst b/docs/usage.rst
index 38364411..a6574c36 100644
--- a/docs/usage.rst
+++ b/docs/usage.rst
@@ -217,7 +217,7 @@ Last, a subfolder called ``sequence-categorisation`` is also part of the ``resul
 including information about specific reads assigned in various categories.
 
 
-.. list-table::
+.. list-table:: sequence-categorisation
    :widths: 25 75
    :header-rows: 1
 
@@ -253,7 +253,5 @@ including information about specific reads assigned in various categories.
      - Predicted transfer RNA (`RF00005 <https://rfam.org/family/RF00005>`_)
    * - ``tRNA-Sec.RF01852.fasta.gz``
      - Predicted Selenocysteine transfer RNA (`RF01852 <https://rfam.org/family/RF01852>`_) 
-   * - ``taxonomy-summary``
-     - sd