UNIQmin: An alignment-independent tool for the study of pathogen sequence diversity at any given rank of taxonomy lineage
Sequence variation among pathogens, even of a single amino acid, can expand their host repertoire or enhance the infection ability. Alignment independent approach represents an alternative approach to the study of pathogen diversity, which is devoid of the need for sequence conservation to perform comparative analyses. Herein, we present UNIQmin, a tool that utilises an alignment independent method to generate the minimal set of pathogen sequences, as a way to study their diversity, across any rank of taxonomic lineage. The minimal set refers to the smallest possible number of sequences required to capture the entire repertoire of pathogen peptidome diversity present in a sequence dataset.
- Step-by-step of UNIQmin
- Figure Scheme
- UNIQmin as a Pipeline
- Generate a random protein sequence dataset
- Citing Resources
- Found a Bug
Please refer to the PythonScript folder.
As visualised above, UNIQmin comprises of five steps with respective python scripts employed according to the order of step (server specs: Intel(R) Xeon(R) E5-2690 v2 @ 3.00GHz 40-core processors, 396 GB of RAM and 44 TB of local storage. The single pipeline shell script (UNIQmin.sh), sample input file (exampleinput.fas) and example output (exampleoutput.fasta) are provided.
uniqmin.sh
python uniqmin.py -i exampleinput.fas -o example -k 9 -cpu 14
-
via pip
pip install uniqmin
-
via package clone from GitHub repository
git clone https://github.com/ChongLC/MinimalSetofViralPeptidome-UNIQmin.git
Note for user who uses conda environment (e.g.: jupyter notebook):
Beforepip
installing the package, runconda config --add channels conda-forge conda install pyahocorasick
... and restart the kernel to use the updated package. Then, run
pip install uniqmin
pip install uniqmin --upgrade
uniqmin [-h] [-i INPUT] [-o OUTPUT] [-k KMERLENGTH] [-cpu CPUSIZE]
For example, UNIQmin tool is applied to generate a minimal set (in example
folder) with a sample input file (exampleinput.fas). A k-mer window size of nine (9; nonamer) is used with utilising 14-cores.
uniqmin -i exampleinput.fas -o example -k 9 -cpu 14
Argument | Parameter | Type | Required | Default | Description |
---|---|---|---|---|---|
-h | help | N/A | FALSE | N/A | Show this help message and exit |
-i | sequence input file | String | TRUE | N/A | Path of the input file (in FASTA format) |
-o | output directory name | String | TRUE | N/A | Path of the output file to be created |
-k | k-mer window size | Integer | FALSE | 9 | The length of k-mers to be used |
-cpu | cpu size | Integer | FALSE | 14 | The number of CPU cores to be used |
This section is particular for the Protocol paper. For the details of this section and the python script, please refer to the randomizer folder.
- For original paper, please refer to our MDPI Biology paper:
Chong, L.C.; Lim, W.L.; Ban, K.H.K.; Khan, A.M. An Alignment-Iindependent Approach for the Study of Viral Sequence Diversity at Any Given Rank of Taxonomy Lineage. Biology 2021, 10, 853. doi: 10.3390/biology10090853 - For protocol paper, please refer to our preprint:
Chong, L.C.; Khan, A.M. UNIQmin, An Alignment-free Tool to Study Viral Sequence Diversity across Taxonomic Lineages: A Case Study of Monkeypox Virus. bioRxiv 2022.08.09.503271. doi: 10.1101/2022.08.09.503271
Or would like a feature added? Or maybe drop some feedback? Just open a new issue or send an email to us (lichuinchong@gmail.com).