Welcome to the GitHub repository dedicated to our study evaluating tools for functional profiling based on 16S rRNA gene sequencing. Within this repository, you will find scripts and resources essential for our thorough assessment of prominent tools like PICRUSt2, Tax4Fun2, PanFP, and MetGEMs toolbox. Our primary objective in this project is to underscore the limitations inherent in these tools when inferring functional profiles from 16S rRNA gene sequencing data.
The "Datasets" folder encompasses the output generated by the aforementioned functional inference tools—namely, PICRUSt2, Tax4Fun2, PanFP, and MetGEMs toolbox—for both real and simulated datasets, accompanied by their corresponding mapping files.
We evaluated the performance of these functional inference tools using simulated metagenomic samples obtained from the 2nd Critical Assessment of Metagenome Interpretation (CAMI) Challenge. Within the "simulation" folder under "datasets," Both downloaded shotgun metagenome datasets and derived 16S rRNA full-length sequences for each body site can be found. The data can be downloaded from (https://frl.publisso.de/data/frl:6425518/).
Individual folders are designated for each population cohort, with each file named according to the respective tool used. For instance, "CRC_PICRUST2_KO.tsv" denotes KO terms retrieved from PICRUSt2, while files with the suffix "REL_KO" indicate relative abundance. Files with "CUSTOM_KO_REL" signify the use of a customized normalization method. This naming convention applies to other cohorts such as POPGEN, FOCUS, and KORA.
Contained within the "processing_script" folder are the scripts utilized to derive KO abundance from the aforementioned four tools, respectively. It also contains the script which is used to filter 16S rrNA gene sequences from simulated metagenome sequences using FilterRead pipeline.
This folder encompasses R scripts for all downstream analyses, including differential abundance testing for each cohort and simulated datasets