Skip to content

JLewis-dev/GAMA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GAMA

Genomic Availability & Metadata Analysis Tool

⚠️ Development Status: Active / Early Stage
GAMA is currently in early development. Interfaces, methods, and outputs may change.
Users are encouraged to validate results independently and report issues.

GAMA is an R-based framework for surveying publicly accessible sequencing data across NCBI Assembly, SRA, and BioSample. Its aim is to support feasibility assessments for in silico research on underutilised plants.


Overview

GAMA:

  • Unifies NCBI database searches
  • Computes a data richness score
  • Classifies SRA accessions by experimental modality
  • Enables strategic parsing of Assembly and SRA results
  • Generates publication-ready visuals

See the GAMA user guide for a comprehensive overview of functions and methods.


Installation

Install the development version from GitHub using pak:

install.packages('pak')
pak::pak('JLewis-dev/GAMA')

Quick-Start Example

1. Load package

library(GAMA)

2. Configure NCBI access to improve rate limits (optional)

#rentrez::set_entrez_key('YOUR_API_KEY')

Uncomment and add your API key if you have one.

3. Query NCBI databases using a list of species

RESULTS <- query_species(c('Vigna angularis', 'Vigna vexillata'))

4. Summarise data richness

SUMMARY <- summarise_availability(RESULTS)
print(SUMMARY)

5. Visualise data richness

plot_availability(SUMMARY)

6. Summarise SRA modality composition

META <- summarise_sra_availability(RESULTS)
print(META)

7. Visualise SRA modality composition

plot_sra_availability(META)

8. Extract filtered Assembly accession metadata

ASM <- extract_assembly_metadata(RESULTS, best = TRUE)
print(ASM)

9. Extract filtered SRA accession metadata

SRA <- extract_sra_metadata(RESULTS, species = 'Vigna vexillata', class = 'genomic')
print(SRA)

10. Cite

citation('GAMA')

Visualisation Tools

GAMA includes built-in plotting functions for rapid assessment.

Data richness plots

plot_availability() produces stacked bar plots showing:

  • Assembly contribution
  • SRA contribution
  • BioSample contribution
  • Overall data richness score

Supports custom ranking, colour palettes, and ggplot2 theming.

SRA modality plots

plot_sra_availability() visualises:

  • Relative abundance of sequencing strategies
  • Ontology-classified experiment types
  • Cross-species comparisons

Optional GEO overlays and experimental distribution summaries are available via additional functions.

Example outputs

Data richness plot

GAMA example: data richness plot

SRA modality plot

GAMA example: SRA modality plot

SRA skew plot

GAMA example: SRA skew plot


Transparency

Queries automatically record:

  • Tool version
  • Timestamp
  • Database sources

This metadata is embedded in outputs.


Data Richness

The data richness score is defined as:

Score = A + S + B

Where A, S, and B are the transformed contributions of Assembly, SRA, and BioSample accession counts.

A = best + ln(1 + total − best), with assemblies weighted as:

  • Complete = 10
  • Chromosome = 8
  • Scaffold = 5
  • Contig = 2

Here, best is the maximum weight assembly (ties broken by highest N50) and total is the sum of all accession weights.

S = 2·ln(1 + SRA)
B = ln(1 + BioSample)

This formulation prioritises high-quality assemblies while incorporating diminishing returns for extensively sampled taxa.


Ontology-Driven Classification

SRA experiments are classified using an ontology derived from large-scale metadata mining and manual curation.

Genomic

  • WGS
  • Amplicon-seq
  • RAD-seq
  • Targeted-Capture
  • Clone-based

Transcriptomic

  • RNA-seq
  • small-RNA
  • Long-read

Epigenomic

  • Bisulfite-seq
  • ChIP-seq
  • CUT&RUN
  • CUT&Tag
  • ATAC-seq
  • DNase-seq
  • FAIRE-seq
  • MNase-seq
  • SELEX

Chromatin

  • Hi-C
  • 3C-based
  • ChIA-PET
  • TCC

Other

  • Other

Fallback rules are applied when primary metadata fields are missing or ambiguous.


Intended Use

GAMA is designed for:

  • Grant and project scoping
  • Identification of under-studied taxa
  • Strategic prioritisation of existing datasets

It is particularly suited to investigations of underutilised and non-model plant species.


Limitations

  • Dependent on NCBI metadata quality
  • Runtime increases with species list size
  • Novel protocols may not be fully captured by the ontology
  • Results should be interpreted cautiously during early development

Licence

See the LICENSE files for details.


About

An R toolkit for rapid assessment of public sequencing data availability.

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

 
 
 

Languages