Skip to content

KoslickiLab/dnds-using-fmh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FracMinHash dN/dS: an alignment-free measure of selection pressures for metagenomic samples

Summary

Traditional dN/dS models infer selection pressues of protein-coding genes by estimating the rates of nonsynonymous substitutions to synonymous ones using sequence alignments. However, as sequencing becomes more accessible and data volumes grow, alignment-free methods are gaining traction. Therefore, we developed an alignment-free dN/dS estimator to apply on pairwise analyses of longer sequences such as genomes.

The FracMinHash containment index has been linked to the simple mutation model, enabling alignment-free estimation of average nucleotide identity (ANI) and mutation rates between genomes. This framework has been extended to protein sequences via average amino acid identity (AAI), allowing joint estimation of dN/dS ratios for genomic selection pressures.

In short, by sketching genomes with FracMinHash and comparing containments, evolutionary metrics such as dN/dS can be inferred without sequence alignments.

FracMinHash dN/dS Workflow

In the intitial steps of estimating FracMinHash dN/dS, sourmash is used to sketch genomic datasets and conduct pairwise comparisons, producing FracMinHash containment indices required for estimating FracMinHash dN/dS.

After generating these sketches, the pairwise containment indices for both DNA and protein sequences are calculated individually.

Finally, these containment indices are utilized within Python scripts to estimate FracMinHash-based dN/dS.

Description

Executing FracMinHash dN/dS

The script utilizes the following required parameters.

Parameter Explanation
fasta_input_list Input csv file that contains fasta files for sketching. This csv file follows sourmash scripts example, where in the first column is name, second column is dna fasta filename, and third column is protein fasta filename.
scaled_input Identify a scaled factor for signature sketches. Default: 500. Use a scale factor of at least 10 for thousands of genomes.
ksize Length of k-mer. This is the k-mer size for the amino acid (i.e., kaa). The program obtains the k-size of the nucleotide sequencing by calculating 3*kaa. Default is 7.
directory Output directory for estimations.
cores Total core usage (default: 100, ideal when using thousands of genomes)
threshold Set containment threshold (default: 0.05, used in sourmash plugin branchwater commands)

To execute, use the following commad:

python3 fmh_dnds.py --fasta_input_list datasets.csv --directory .

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •