Traditional dN/dS models infer selection pressues of protein-coding genes by estimating the rates of nonsynonymous substitutions to synonymous ones using sequence alignments. However, as sequencing becomes more accessible and data volumes grow, alignment-free methods are gaining traction. Therefore, we developed an alignment-free dN/dS estimator to apply on pairwise analyses of longer sequences such as genomes.
The FracMinHash containment index has been linked to the simple mutation model, enabling alignment-free estimation of average nucleotide identity (ANI) and mutation rates between genomes. This framework has been extended to protein sequences via average amino acid identity (AAI), allowing joint estimation of dN/dS ratios for genomic selection pressures.
In short, by sketching genomes with FracMinHash and comparing containments, evolutionary metrics such as dN/dS can be inferred without sequence alignments.
In the intitial steps of estimating FracMinHash dN/dS, sourmash is used to sketch genomic datasets and conduct pairwise comparisons, producing FracMinHash containment indices required for estimating FracMinHash dN/dS.
After generating these sketches, the pairwise containment indices for both DNA and protein sequences are calculated individually.
Finally, these containment indices are utilized within Python scripts to estimate FracMinHash-based dN/dS.
The script utilizes the following required parameters.
Parameter | Explanation |
---|---|
fasta_input_list | Input csv file that contains fasta files for sketching. This csv file follows sourmash scripts example, where in the first column is name, second column is dna fasta filename, and third column is protein fasta filename. |
scaled_input | Identify a scaled factor for signature sketches. Default: 500. Use a scale factor of at least 10 for thousands of genomes. |
ksize | Length of k-mer. This is the k-mer size for the amino acid (i.e., kaa). The program obtains the k-size of the nucleotide sequencing by calculating 3*kaa. Default is 7. |
directory | Output directory for estimations. |
cores | Total core usage (default: 100, ideal when using thousands of genomes) |
threshold | Set containment threshold (default: 0.05, used in sourmash plugin branchwater commands) |
To execute, use the following commad:
python3 fmh_dnds.py --fasta_input_list datasets.csv --directory .