FracMinHash d_N/d_S: an alignment-free measure of selection pressures for metagenomic samples

Summary

Traditional d_N/d_S models infer selection pressues of protein-coding genes by estimating the rates of nonsynonymous substitutions to synonymous ones using sequence alignments. However, as sequencing becomes more accessible and data volumes grow, alignment-free methods are gaining traction. Therefore, we developed an alignment-free d_N/d_S estimator to apply on pairwise analyses of longer sequences such as genomes.

The FracMinHash containment index has been linked to the simple mutation model, enabling alignment-free estimation of average nucleotide identity (ANI) and mutation rates between genomes. This framework has been extended to protein sequences via average amino acid identity (AAI), allowing joint estimation of d_N/d_S ratios for genomic selection pressures.

In short, by sketching genomes with FracMinHash and comparing containments, evolutionary metrics such as d_N/d_S can be inferred without sequence alignments.

FracMinHash d_N/d_S Workflow

In the intitial steps of estimating FracMinHash d_N/d_S, sourmash is used to sketch genomic datasets and conduct pairwise comparisons, producing FracMinHash containment indices required for estimating FracMinHash d_N/d_S.

After generating these sketches, the pairwise containment indices for both DNA and protein sequences are calculated individually.

Finally, these containment indices are utilized within Python scripts to estimate FracMinHash-based d_N/d_S.

Executing FracMinHash d_N/d_S

The script utilizes the following required parameters.

Parameter	Explanation
fasta_input_list	Input csv file that contains fasta files for sketching. This csv file follows sourmash scripts example, where in the first column is name, second column is dna fasta filename, and third column is protein fasta filename.
scaled_input	Identify a scaled factor for signature sketches. Default: 500. Use a scale factor of at least 10 for thousands of genomes.
ksize	Length of k-mer. This is the k-mer size for the amino acid (i.e., kaa). The program obtains the k-size of the nucleotide sequencing by calculating 3*kaa. Default is 7.
directory	Output directory for estimations.
cores	Total core usage (default: 100, ideal when using thousands of genomes)
threshold	Set containment threshold (default: 0.05, used in sourmash plugin branchwater commands)

To execute, use the following commad:

python3 fmh_dnds.py --fasta_input_list datasets.csv --directory .

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
deprecated		deprecated
fmh_dnds		fmh_dnds
images		images
src		src
README.md		README.md
fmh_dnds.py		fmh_dnds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FracMinHash d_N/d_S: an alignment-free measure of selection pressures for metagenomic samples

Summary

FracMinHash d_N/d_S Workflow

Executing FracMinHash d_N/d_S

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

KoslickiLab/dnds-using-fmh

Folders and files

Latest commit

History

Repository files navigation

FracMinHash dN/dS: an alignment-free measure of selection pressures for metagenomic samples

Summary

FracMinHash dN/dS Workflow

Executing FracMinHash dN/dS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

FracMinHash d_N/d_S: an alignment-free measure of selection pressures for metagenomic samples

FracMinHash d_N/d_S Workflow

Executing FracMinHash d_N/d_S

Packages