This repository contains all scripts and code used for the analyses presented in ARMS-MBON's Second Data Paper. For related datasets, please refer to:
- data_release_002: Occurrence and event data.
- analysis_release_002: Bioinformatics pipeline outputs.
Description:
This script performs blank curation as the first step of data processing. Using the prevalence method from the decontam
R package, it identifies and removes potential contaminants from the dataset.
Description:
The second step involves renaming sample identifiers. PEMA outputs use ENA accession numbers as sample names, which are replaced with their corresponding material sample IDs for clarity and consistency.
Description:
This step merges data from the PEMA outputs, including read count tables, taxonomy assignments, and FASTA files, for each genetic marker. Separate scripts handle each marker:
Processes data for the COI gene.
Processes data for the 18S gene.
Processes data for the ITS gene.
Description:
The final step involves exploratory data analysis and visualization, including:
- Curation of merged datasets.
- Assessment of sequencing depth.
- Visualization of recovered phyla and species.
- Creation of an UpSet plot to show the overlap in species identified across marker gene datasets.
- Comparisons between datasets from the first data paper (DP001) and the second data paper (DP002).
By providing this comprehensive code and documentation, we aim to ensure transparency and reproducibility for all analyses conducted in the ARMS-MBON project.