-
Notifications
You must be signed in to change notification settings - Fork 0
Home
pip install phyloki
Input
! Phyloki -h
Output
usage: Phyloki [--get_sequences] [--fetch_metadata] [--get_organisms]
[--update_tree] [--get_hosts] [--get_hosts_order] [-V]
Phyloki: metadata fetcher for microbial phylogenetics.
options:
--get_sequences Run the module that retrieves nucleotide sequences from a
list of accession numbers.
--fetch_metadata Run the module to fetch metadata associated with
accession numbers.
--get_organisms Run the module that fetches organism names by their
corresponding accession versions.
--update_tree Run the module that updates a phylogenetic tree by
replacing accession numbers with accession numbers +
organism names.
--get_hosts Run the module that retrieves host information for given
accession numbers.
--get_hosts_order Run the module that fetches the taxonomic order for host
organisms.
-V, --version Display the version of Phyloki and exit.
Phyloki
consists of 6 modules. Let's break down them!
To use Phyloki
user must have accession numbers.txt
file. It must look like this:
Input
! head -5 accession_numbers.txt
Output
NC_034519
NC_055636
NC_005225
NC_038939
NC_038529
Input
! Phyloki --get_sequences -h
Output
usage: get_seqs.py [-h] -email EMAIL -i INPUT -o OUTPUT
Fetch nucleotide sequences from NCBI and save them as FASTA files.
options:
-h, --help show this help message and exit
-email, --email EMAIL
User email (required for NCBI API requests)
-i, --input INPUT Path to TXT file containing accession numbers (one per
line)
-o, --output OUTPUT Directory to save downloaded FASTA files
Input
! Phyloki --get_sequences -email ivpopov@donstu.ru -i accession_numbers.txt -o sample_seq_directory
Output
Downloaded: NC_034519
Downloaded: NC_055636
Downloaded: NC_005225
Downloaded: NC_038939
Downloaded: NC_038529
Downloaded: NC_034467
Downloaded: NC_034553
Downloaded: NC_003468
Downloaded: NC_034515
Downloaded: NC_038299
Downloaded: NC_077671
Downloaded: NC_034403
Downloaded: NC_038695
Downloaded: NC_034556
Downloaded: LC553715
Downloaded: NC_005238
Downloaded: NC_005235
Downloaded: NC_034517
Downloaded: NC_006435
Downloaded: NC_005222
Downloaded: NC_055147
Downloaded: NC_034560
Downloaded: NC_034485
Downloaded: NC_034407
Downloaded: NC_034399
Downloaded: NC_034402
Downloaded: MG663536
Downloaded: NC_038515
Downloaded: NC_078262
Downloaded: NC_055632
Downloaded: NC_034401
Downloaded: OR684449
Downloaded: FJ593498
Downloaded: KX512433
Downloaded: NC_078485
Downloaded: KX779126
Downloaded: NC_034564
Downloaded: NC_010707
Downloaded: NC_055170
All downloads completed.
For demo purpose I use the data from:
Barbosa dos Santos, M., Koide Albuquerque, N., Patroca da Silva, S. et al. A novel hantavirus identified in bats (Carollia perspicillata) in Brazil. Sci Rep 14, 6346 (2024). https://doi.org/10.1038/s41598-024-56808-6
The tree is precomputed and stored in demo_data/
directory.
Upload demo_data/tree_ufb.treefile
to iTOL for visualization.
data:image/s3,"s3://crabby-images/301cc/301ccd6e1fa96435d1452dad7df6d9fd6ad8c9e7" alt=""
Figure 1. Reference tree from the original paper
data:image/s3,"s3://crabby-images/11118/111185569c8aca2a286cacbd1dc1e5e72cd4b344" alt=""
Figure 2. Naked phylogenetic tree
This tree is naked.
There is no:
- Annotation of the organisms name. There are only accession numbers that cannot say anything.
- The tree demonstrates phylogenetic relationships between different viruses. But there is no information about host organisms of that viruses.
It is worth mentioning that the trees are literally identical. (Bootstrap values are even better in my variant).
At this point user stands in bifurcation point: user can fetch all metadata from NCBI GenBank and use it for annotating in ggtree
Or user can create dataset for using it in iTOL
First, the guide for fetching metadata will be provided
Phyloki --fetch_metadata -h
Output
usage: fetch_metadata.py [-h] -email EMAIL -i INPUT -o OUTPUT
Fetch metadata for nucleotide sequences from NCBI.
options:
-h, --help show this help message and exit
-email, --email EMAIL
User email (required for NCBI API requests)
-i, --input INPUT Path to TXT file containing accession numbers (one per
line)
-o, --output OUTPUT Path to output file (.tsv) to save retrieved metadata
Input
! Phyloki --fetch_metadata -email ivpopov@donstu.ru -i accession_numbers.txt -o sample_metadata.tsv
Output
Metadata retrieval complete.
File saved to sample_metadata.tsv
Input
import pandas as pd
df = pd.read_csv('sample_metadata.tsv', sep='\t')
df.head(5)
Output
AN AN_OrganismName Country Year Host
0 NC_034519.1 NC_034519.1 Orthohantavirus khabarovskense China 2011 Microtus maximowiczii
1 NC_055636.1 NC_055636.1 Orthohantavirus tatenalense United Kingdom 2014 Microtus agrestis
2 NC_005225.1 NC_005225.1 Orthohantavirus puumalaense ND ND ND
3 NC_038939.1 NC_038939.1 Orthohantavirus prospectense USA ND Microtus pennsylvanicus
4 NC_038529.1 NC_038529.1 Eothenomys miletus hantavirus LX309 China 2009 Eothenomys miletus
This is a table separated values:
- Accession number
- Accession number & organism name
- Country (/geo_loc_name) where organism was isolated
- Year when organism was isolated
- Host from whom organism was isolated
User can use this metadata to annotate tree in ggtree
! Phyloki --get_organisms -h
Output
usage: get_organisms.py [-h] -email EMAIL -i INPUT -o OUTPUT
Fetch organism names and accession versions from NCBI.
options:
-h, --help show this help message and exit
-email, --email EMAIL
User email (required for NCBI API requests)
-i, --input INPUT Path to TXT file containing accession numbers (one per
line)
-o, --output OUTPUT File path to save organism names and accession
versions
Input
! Phyloki --get_organisms -email ivpopov@donstu.ru -i sample_accession_numbers.txt -o accession_organism.txt
Output
Metadata retrieval complete.
File saved to accession_organism.txt
Input
! head -5 accession_organism.txt
Output
NC_034519.1 Orthohantavirus khabarovskense
NC_055636.1 Orthohantavirus tatenalense
NC_005225.1 Orthohantavirus puumalaense
NC_038939.1 Orthohantavirus prospectense
NC_038529.1 Eothenomys miletus hantavirus LX309
Now let's update the tree!
Input
! Phyloki --update_tree -h
Output
usage: update_tree.py [-h] -annotation ANNOTATION -tree TREE
-upd_tree UPD_TREE
Update a tree file with annotated organism names based on accession numbers.
options:
-h, --help show this help message and exit
-annotation ANNOTATION
Path to the text file containing accession numbers and
organism names.
-tree TREE Path to the tree file that needs to be updated.
-upd_tree UPD_TREE Path to save the updated tree file.
Input
! Phyloki --update_tree -annotation accession_organism.txt -tree sample_tree.treefile -upd_tree annotated_tree.treefile
Output
The request has been fulfilled.
File saved to annotated_tree.treefile
Input
! head tree_ufb.treefile
Output
(FJ593498.1:0.1240225441,KX512433.1:0.1580233515,((((((KX779126.1:0.1801341369,NC_034564.1:0.1518834757)100:0.2690724126,NC_010707.1:0.4026159852)100:0.5357342048,NC_055170.1:3.2821188681)96:0.1993926731,(((((((LC553715.1:0.2424396410,NC_034556.1:0.2425091493)100:0.1540926638,NC_005238.1:0.2987153355)100:0.0815585291,((NC_005222.1:0.1745023294,NC_006435.1:0.1329576555)100:0.2619343862,(NC_005235.1:0.3091225291,NC_034517.1:0.3426538757)100:0.0890215926)59:0.0409665566)100:0.2464858691,NC_055147.1:0.5009202874)69:0.0579848758,((NC_034399.1:0.4574934455,NC_034407.1:0.4201827666)100:0.2133110745,(NC_034485.1:0.3554924125,NC_034560.1:0.3958031671)100:0.1134597575)100:0.0983251229)88:0.0539862361,NC_034402.1:0.6194957047)100:0.2179091508,(((((NC_003468.2:0.3220309131,NC_034553.1:0.3217768427)100:0.0967750566,(NC_034515.1:0.3420020277,NC_038299.1:0.3578938480)78:0.0604905717)100:0.0681808060,(NC_034403.1:0.4057149461,NC_077671.1:0.3295415521)96:0.0808310506)100:0.1515347499,NC_038695.1:0.6146030315)75:0.0529123693,(((NC_005225.1:0.3178994625,(NC_034519.1:0.2903408237,NC_055636.1:0.2951103060)96:0.0707049689)100:0.1162566928,NC_038939.1:0.4860686808)100:0.0974816090,(NC_034467.1:0.3408088379,NC_038529.1:0.3214413064)100:0.1876647016)100:0.0906674433)100:0.3112288106)100:0.2995543026)97:0.1136359007,NC_078485.1:1.2137610889)49:0.0697049196,(((MG663536.1:0.4927348232,NC_038515.1:0.3837609395)94:0.0895431598,NC_078262.1:0.4767046102)100:0.2182159381,((NC_034401.1:0.5482148765,NC_055632.1:0.5333969980)100:0.2727779310,OR684449.1:0.6549294470)90:0.1135643862)55:0.0661132415)100:0.9075896851);
Usual treefile contains only accession numbers. They cannot say anything.
Input
! head annotated_tree.treefile
Output
(FJ593498.1 Nova virus:0.1240225441,KX512433.1 Nova virus:0.1580233515,((((((KX779126.1 Imjin virus:0.1801341369,NC_034564.1 Imjin virus:0.1518834757)100:0.2690724126,NC_010707.1 Thottapalayam virus:0.4026159852)100:0.5357342048,NC_055170.1 Hainan oriental leaf-toed gecko hantavirus:3.2821188681)96:0.1993926731,(((((((LC553715.1 Orthohantavirus thailandense:0.2424396410,NC_034556.1 Anjozorobe virus:0.2425091493)100:0.1540926638,NC_005238.1 Orthohantavirus seoulense:0.2987153355)100:0.0815585291,((NC_005222.1 Orthohantavirus hantanense:0.1745023294,NC_006435.1 Hantavirus Z10:0.1329576555)100:0.2619343862,(NC_005235.1 Orthohantavirus dobravaense:0.3091225291,NC_034517.1 Orthohantavirus sangassouense:0.3426538757)100:0.0890215926)59:0.0409665566)100:0.2464858691,NC_055147.1 Tigray virus:0.5009202874)69:0.0579848758,((NC_034399.1 Jeju virus:0.4574934455,NC_034407.1 Bowe virus:0.4201827666)100:0.2133110745,(NC_034485.1 Orthohantavirus caobangense:0.3554924125,NC_034560.1 Kenkeme virus:0.3958031671)100:0.1134597575)100:0.0983251229)88:0.0539862361,NC_034402.1 Bruges virus:0.6194957047)100:0.2179091508,(((((NC_003468.2 Orthohantavirus andesense:0.3220309131,NC_034553.1 Maporal virus:0.3217768427)100:0.0967750566,(NC_034515.1 Orthohantavirus delgaditoense:0.3420020277,NC_038299.1 Orthohantavirus bayoui:0.3578938480)78:0.0604905717)100:0.0681808060,(NC_034403.1 Orthohantavirus montanoense:0.4057149461,NC_077671.1 Orthohantavirus sinnombreense:0.3295415521)96:0.0808310506)100:0.1515347499,NC_038695.1 Rockport virus:0.6146030315)75:0.0529123693,(((NC_005225.1 Orthohantavirus puumalaense:0.3178994625,(NC_034519.1 Orthohantavirus khabarovskense:0.2903408237,NC_055636.1 Orthohantavirus tatenalense:0.2951103060)96:0.0707049689)100:0.1162566928,NC_038939.1 Orthohantavirus prospectense:0.4860686808)100:0.0974816090,(NC_034467.1 Fugong virus:0.3408088379,NC_038529.1 Eothenomys miletus hantavirus LX309:0.3214413064)100:0.1876647016)100:0.0906674433)100:0.3112288106)100:0.2995543026)97:0.1136359007,NC_078485.1 Lena virus:1.2137610889)49:0.0697049196,(((MG663536.1 Dakrong virus:0.4927348232,NC_038515.1 Laibin virus:0.3837609395)94:0.0895431598,NC_078262.1 Xuan son virus:0.4767046102)100:0.2182159381,((NC_034401.1 Quezon virus:0.5482148765,NC_055632.1 Orthohantavirus robinaense:0.5333969980)100:0.2727779310,OR684449.1 Buritiense virus:0.6549294470)90:0.1135643862)55:0.0661132415)100:0.9075896851);
Modified treefile contains accession numbers and organisms names. It makes more sense.
Input
! Phyloki --get_hosts -h
Output
usage: get_hosts.py [-h] -email EMAIL -i INPUT -o OUTPUT
Retrieve host information for a list of accession numbers.
options:
-h, --help show this help message and exit
-email, --email EMAIL
User email (required for NCBI API requests)
-i, --input INPUT TXT file with the list of accession numbers
-o, --output OUTPUT Output file to save host information
Input
! Phyloki --get_hosts -email ivpopov@donstu.ru -i sample_accession_numbers.txt -o accession_host.txt
Output
The request has been fulfilled.
File saved to accession_host.txt
Input
! head -5 accession_host.txt
Output
NC_034519.1 Microtus maximowiczii
NC_055636.1 Microtus agrestis
NC_005225.1 ND
NC_038939.1 Microtus pennsylvanicus
NC_038529.1 Eothenomys miletus
Input
! Phyloki --get_hosts_order -h
Output
usage: get_hosts_order.py [-h] -email EMAIL -i INPUT -o OUTPUT
Fetch taxonomic orders for host organisms based on accession numbers.
options:
-h, --help show this help message and exit
-email, --email EMAIL
User email (required for NCBI API requests)
-i, --input INPUT TXT file containing accession numbers and host species
-o, --output OUTPUT Output file to save accession numbers with their
hosts' taxonomic orders
Input
! Phyloki --get_hosts_order -email ivpopov@donstu.ru -i accession_host.txt -o accession_hosts_order.txt
Output
The request has been fulfilled.
File saved to accession_hosts_order.txt
Please do not forget to edit the file manually.
The query to NCBI database from this function is pretty difficult.
Sometimes this function prints:
"Error - HTTP Error 400: Bad Request" in case of bad connection or
"Note - False record" in case there is no record about the host organism.
Input
! head -5 demo_data/accession_order.txt
Output
NC_034519.1 Rodentia
NC_055636.1 Rodentia
NC_005225.1 ND
NC_038939.1 Rodentia
NC_038529.1 Rodentia
Input
from phyloki import dataset4itol as d4i
Input
unique_orders = d4i.get_unique_orders("accession_order.txt")
print(unique_orders)
Output
['Rodentia', 'ND', 'Eulipotyphla', 'Chiroptera', 'Squamata']
Input
color_map = d4i.set_color_map("accession_order.txt")
print(color_map)
Interactive window will open and will ask to set HEX codes for each unique order
Alternatively, user can set color_map
manually
Input
color_map = {'Chiroptera': '#32cd32', 'Eulipotyphla': '#ffd700', 'Rodentia': '#1e90ff', 'ND': '#FFFFFF', 'Primates': '#8a2be2'}
print(color_map)
Output
{'Rodentia': '#0ca20c', 'ND': '#ffffff', 'Eulipotyphla': '#0078ff', 'Chiroptera': '#000000', 'Squamata': '#ffa500'}
Input
- input txt file with the list of accession numbers and organisms names
- input txt file with the list of accession numbers and taxonomic order of microorganism host
- output file
- manually created color map
d4i.get_itol_dataset("accession_organism.txt", "accession_order.txt", "dataset_for_iTOL.txt", color_map)
Output
Colors were set by the user.
The request has been fulfilled.
Input
! head -5 dataset_for_iTOL.txt
Output
DATASET_COLORSTRIP
SEPARATOR TAB
DATASET_LABEL Host Group Colors
DATA
NC_034519.1 Orthohantavirus khabarovskense #0ca20c Rodentia
- Visit iTOL
- Upload
annotated_tree.treefile
file as the tree - Upload
dataset_for_iTOL.txt
as the annotation dataset
data:image/s3,"s3://crabby-images/787f2/787f2d5c132a348600c051cd5148ed271bd581c5" alt=""
Fig 3. Second tree. With annotation info containing organisms names and manually adjusted colors indicating hosts taxonomic order
This is the best tree easily made with Phyloki
software
Let's take a look at the original tree again
data:image/s3,"s3://crabby-images/301cc/301ccd6e1fa96435d1452dad7df6d9fd6ad8c9e7" alt=""
It can be seen that in original version authors did annotation manually and they made some mistakes in hosts annotation. Phyloki
software did not make this mistakes.
Input
- input txt file with the list of accession numbers and organisms names
- input txt file with the list of accession numbers and taxonomic order of microorganism host
- output file
d4i.get_itol_dataset("accession_organism.txt", "demo_data/accession_order.txt", "demo_data/dataset_for_iTOL_2.txt")
Output
Colors were not set, they were generated randomly.
The request has been fulfilled.
Input
! head -5 demo_data/dataset_for_iTOL_2.txt
Output
DATASET_COLORSTRIP
SEPARATOR TAB
DATASET_LABEL Host Group Colors
DATA
NC_034519.1 Orthohantavirus khabarovskense #e31342 Rodentia
- Visit iTOL
- Upload
annotated_tree.treefile
file as the tree - Upload
dataset_for_iTOL_2.txt
as the annotation dataset
data:image/s3,"s3://crabby-images/381ee/381eec58626e49ce5abbdf4d723390bf8db7e373" alt=""
Fig 4. Third tree. With annotation info containing organisms names and randomly generated colors indicating hosts taxonomic order
In this case random generation played a bad joke! Almost every color is the same. It will be much more convenient to adjust color map manually.