Skip to content
Ilia Popov edited this page Feb 19, 2025 · 6 revisions

Phyloki Usage Guide

Installation

pip install phyloki

Usage

Input

! Phyloki -h

Output

usage: Phyloki [--get_sequences] [--fetch_metadata] [--get_organisms]
               [--update_tree] [--get_hosts] [--get_hosts_order] [-V]

Phyloki: metadata fetcher for microbial phylogenetics.

options:
  --get_sequences    Run the module that retrieves nucleotide sequences from a
                     list of accession numbers.
  --fetch_metadata   Run the module to fetch metadata associated with
                     accession numbers.
  --get_organisms    Run the module that fetches organism names by their
                     corresponding accession versions.
  --update_tree      Run the module that updates a phylogenetic tree by
                     replacing accession numbers with accession numbers +
                     organism names.
  --get_hosts        Run the module that retrieves host information for given
                     accession numbers.
  --get_hosts_order  Run the module that fetches the taxonomic order for host
                     organisms.
  -V, --version      Display the version of Phyloki and exit.

Phyloki consists of 6 modules. Let's break down them!

To use Phyloki user must have accession numbers.txt file. It must look like this:

Input

! head -5 accession_numbers.txt

Output

NC_034519
NC_055636
NC_005225
NC_038939
NC_038529

Example of using the get_sequences module

Input

! Phyloki --get_sequences -h

Output

usage: get_seqs.py [-h] -email EMAIL -i INPUT -o OUTPUT

Fetch nucleotide sequences from NCBI and save them as FASTA files.

options:
  -h, --help            show this help message and exit
  -email, --email EMAIL
                        User email (required for NCBI API requests)
  -i, --input INPUT     Path to TXT file containing accession numbers (one per
                        line)
  -o, --output OUTPUT   Directory to save downloaded FASTA files

Input

! Phyloki --get_sequences -email ivpopov@donstu.ru -i accession_numbers.txt -o sample_seq_directory

Output

Downloaded: NC_034519
Downloaded: NC_055636
Downloaded: NC_005225
Downloaded: NC_038939
Downloaded: NC_038529
Downloaded: NC_034467
Downloaded: NC_034553
Downloaded: NC_003468
Downloaded: NC_034515
Downloaded: NC_038299
Downloaded: NC_077671
Downloaded: NC_034403
Downloaded: NC_038695
Downloaded: NC_034556
Downloaded: LC553715
Downloaded: NC_005238
Downloaded: NC_005235
Downloaded: NC_034517
Downloaded: NC_006435
Downloaded: NC_005222
Downloaded: NC_055147
Downloaded: NC_034560
Downloaded: NC_034485
Downloaded: NC_034407
Downloaded: NC_034399
Downloaded: NC_034402
Downloaded: MG663536
Downloaded: NC_038515
Downloaded: NC_078262
Downloaded: NC_055632
Downloaded: NC_034401
Downloaded: OR684449
Downloaded: FJ593498
Downloaded: KX512433
Downloaded: NC_078485
Downloaded: KX779126
Downloaded: NC_034564
Downloaded: NC_010707
Downloaded: NC_055170
All downloads completed.

What's next? Tree construction.

For demo purpose I use the data from:

Barbosa dos Santos, M., Koide Albuquerque, N., Patroca da Silva, S. et al. A novel hantavirus identified in bats (Carollia perspicillata) in Brazil. Sci Rep 14, 6346 (2024). https://doi.org/10.1038/s41598-024-56808-6

The tree is precomputed and stored in demo_data/ directory.
Upload demo_data/tree_ufb.treefile to iTOL for visualization.

Figure 1. Reference tree from the original paper

Figure 2. Naked phylogenetic tree

This tree is naked.
There is no:

  1. Annotation of the organisms name. There are only accession numbers that cannot say anything.
  2. The tree demonstrates phylogenetic relationships between different viruses. But there is no information about host organisms of that viruses.

It is worth mentioning that the trees are literally identical. (Bootstrap values are even better in my variant).

Bifurcation point

At this point user stands in bifurcation point: user can fetch all metadata from NCBI GenBank and use it for annotating in ggtree Or user can create dataset for using it in iTOL First, the guide for fetching metadata will be provided

Example of downloading all the metadata from NCBI GenBank with fetch_metadata module

Phyloki --fetch_metadata -h

Output

usage: fetch_metadata.py [-h] -email EMAIL -i INPUT -o OUTPUT

Fetch metadata for nucleotide sequences from NCBI.

options:
  -h, --help            show this help message and exit
  -email, --email EMAIL
                        User email (required for NCBI API requests)
  -i, --input INPUT     Path to TXT file containing accession numbers (one per
                        line)
  -o, --output OUTPUT   Path to output file (.tsv) to save retrieved metadata

Input

! Phyloki --fetch_metadata -email ivpopov@donstu.ru -i accession_numbers.txt -o sample_metadata.tsv

Output

Metadata retrieval complete.
File saved to sample_metadata.tsv

Input

import pandas as pd
df = pd.read_csv('sample_metadata.tsv', sep='\t')
df.head(5)

Output

	AN	AN_OrganismName	Country	Year	Host
0	NC_034519.1	NC_034519.1 Orthohantavirus khabarovskense	China	2011	Microtus maximowiczii
1	NC_055636.1	NC_055636.1 Orthohantavirus tatenalense	United Kingdom	2014	Microtus agrestis
2	NC_005225.1	NC_005225.1 Orthohantavirus puumalaense	ND	ND	ND
3	NC_038939.1	NC_038939.1 Orthohantavirus prospectense	USA	ND	Microtus pennsylvanicus
4	NC_038529.1	NC_038529.1 Eothenomys miletus hantavirus LX309	China	2009	Eothenomys miletus

This is a table separated values:

  1. Accession number
  2. Accession number & organism name
  3. Country (/geo_loc_name) where organism was isolated
  4. Year when organism was isolated
  5. Host from whom organism was isolated

User can use this metadata to annotate tree in ggtree

Example of returning organisms names to the tree with get_organisms module

! Phyloki --get_organisms -h

Output

usage: get_organisms.py [-h] -email EMAIL -i INPUT -o OUTPUT

Fetch organism names and accession versions from NCBI.

options:
  -h, --help            show this help message and exit
  -email, --email EMAIL
                        User email (required for NCBI API requests)
  -i, --input INPUT     Path to TXT file containing accession numbers (one per
                        line)
  -o, --output OUTPUT   File path to save organism names and accession
                        versions

Input

! Phyloki --get_organisms -email ivpopov@donstu.ru -i sample_accession_numbers.txt -o accession_organism.txt

Output

Metadata retrieval complete.
File saved to accession_organism.txt

Input

! head -5 accession_organism.txt

Output

NC_034519.1 Orthohantavirus khabarovskense
NC_055636.1 Orthohantavirus tatenalense
NC_005225.1 Orthohantavirus puumalaense
NC_038939.1 Orthohantavirus prospectense
NC_038529.1 Eothenomys miletus hantavirus LX309

Now let's update the tree!

Input

! Phyloki --update_tree -h

Output

usage: update_tree.py [-h] -annotation ANNOTATION -tree TREE
                      -upd_tree UPD_TREE

Update a tree file with annotated organism names based on accession numbers.

options:
  -h, --help            show this help message and exit
  -annotation ANNOTATION
                        Path to the text file containing accession numbers and
                        organism names.
  -tree TREE            Path to the tree file that needs to be updated.
  -upd_tree UPD_TREE    Path to save the updated tree file.

Input

! Phyloki --update_tree -annotation accession_organism.txt -tree sample_tree.treefile -upd_tree annotated_tree.treefile

Output

The request has been fulfilled.
File saved to annotated_tree.treefile

Input

! head tree_ufb.treefile

Output

(FJ593498.1:0.1240225441,KX512433.1:0.1580233515,((((((KX779126.1:0.1801341369,NC_034564.1:0.1518834757)100:0.2690724126,NC_010707.1:0.4026159852)100:0.5357342048,NC_055170.1:3.2821188681)96:0.1993926731,(((((((LC553715.1:0.2424396410,NC_034556.1:0.2425091493)100:0.1540926638,NC_005238.1:0.2987153355)100:0.0815585291,((NC_005222.1:0.1745023294,NC_006435.1:0.1329576555)100:0.2619343862,(NC_005235.1:0.3091225291,NC_034517.1:0.3426538757)100:0.0890215926)59:0.0409665566)100:0.2464858691,NC_055147.1:0.5009202874)69:0.0579848758,((NC_034399.1:0.4574934455,NC_034407.1:0.4201827666)100:0.2133110745,(NC_034485.1:0.3554924125,NC_034560.1:0.3958031671)100:0.1134597575)100:0.0983251229)88:0.0539862361,NC_034402.1:0.6194957047)100:0.2179091508,(((((NC_003468.2:0.3220309131,NC_034553.1:0.3217768427)100:0.0967750566,(NC_034515.1:0.3420020277,NC_038299.1:0.3578938480)78:0.0604905717)100:0.0681808060,(NC_034403.1:0.4057149461,NC_077671.1:0.3295415521)96:0.0808310506)100:0.1515347499,NC_038695.1:0.6146030315)75:0.0529123693,(((NC_005225.1:0.3178994625,(NC_034519.1:0.2903408237,NC_055636.1:0.2951103060)96:0.0707049689)100:0.1162566928,NC_038939.1:0.4860686808)100:0.0974816090,(NC_034467.1:0.3408088379,NC_038529.1:0.3214413064)100:0.1876647016)100:0.0906674433)100:0.3112288106)100:0.2995543026)97:0.1136359007,NC_078485.1:1.2137610889)49:0.0697049196,(((MG663536.1:0.4927348232,NC_038515.1:0.3837609395)94:0.0895431598,NC_078262.1:0.4767046102)100:0.2182159381,((NC_034401.1:0.5482148765,NC_055632.1:0.5333969980)100:0.2727779310,OR684449.1:0.6549294470)90:0.1135643862)55:0.0661132415)100:0.9075896851);

Usual treefile contains only accession numbers. They cannot say anything.

Input

! head annotated_tree.treefile

Output

(FJ593498.1 Nova virus:0.1240225441,KX512433.1 Nova virus:0.1580233515,((((((KX779126.1 Imjin virus:0.1801341369,NC_034564.1 Imjin virus:0.1518834757)100:0.2690724126,NC_010707.1 Thottapalayam virus:0.4026159852)100:0.5357342048,NC_055170.1 Hainan oriental leaf-toed gecko hantavirus:3.2821188681)96:0.1993926731,(((((((LC553715.1 Orthohantavirus thailandense:0.2424396410,NC_034556.1 Anjozorobe virus:0.2425091493)100:0.1540926638,NC_005238.1 Orthohantavirus seoulense:0.2987153355)100:0.0815585291,((NC_005222.1 Orthohantavirus hantanense:0.1745023294,NC_006435.1 Hantavirus Z10:0.1329576555)100:0.2619343862,(NC_005235.1 Orthohantavirus dobravaense:0.3091225291,NC_034517.1 Orthohantavirus sangassouense:0.3426538757)100:0.0890215926)59:0.0409665566)100:0.2464858691,NC_055147.1 Tigray virus:0.5009202874)69:0.0579848758,((NC_034399.1 Jeju virus:0.4574934455,NC_034407.1 Bowe virus:0.4201827666)100:0.2133110745,(NC_034485.1 Orthohantavirus caobangense:0.3554924125,NC_034560.1 Kenkeme virus:0.3958031671)100:0.1134597575)100:0.0983251229)88:0.0539862361,NC_034402.1 Bruges virus:0.6194957047)100:0.2179091508,(((((NC_003468.2 Orthohantavirus andesense:0.3220309131,NC_034553.1 Maporal virus:0.3217768427)100:0.0967750566,(NC_034515.1 Orthohantavirus delgaditoense:0.3420020277,NC_038299.1 Orthohantavirus bayoui:0.3578938480)78:0.0604905717)100:0.0681808060,(NC_034403.1 Orthohantavirus montanoense:0.4057149461,NC_077671.1 Orthohantavirus sinnombreense:0.3295415521)96:0.0808310506)100:0.1515347499,NC_038695.1 Rockport virus:0.6146030315)75:0.0529123693,(((NC_005225.1 Orthohantavirus puumalaense:0.3178994625,(NC_034519.1 Orthohantavirus khabarovskense:0.2903408237,NC_055636.1 Orthohantavirus tatenalense:0.2951103060)96:0.0707049689)100:0.1162566928,NC_038939.1 Orthohantavirus prospectense:0.4860686808)100:0.0974816090,(NC_034467.1 Fugong virus:0.3408088379,NC_038529.1 Eothenomys miletus hantavirus LX309:0.3214413064)100:0.1876647016)100:0.0906674433)100:0.3112288106)100:0.2995543026)97:0.1136359007,NC_078485.1 Lena virus:1.2137610889)49:0.0697049196,(((MG663536.1 Dakrong virus:0.4927348232,NC_038515.1 Laibin virus:0.3837609395)94:0.0895431598,NC_078262.1 Xuan son virus:0.4767046102)100:0.2182159381,((NC_034401.1 Quezon virus:0.5482148765,NC_055632.1 Orthohantavirus robinaense:0.5333969980)100:0.2727779310,OR684449.1 Buritiense virus:0.6549294470)90:0.1135643862)55:0.0661132415)100:0.9075896851);

Modified treefile contains accession numbers and organisms names. It makes more sense.

Example of fetching hosts info with get_hosts module

Input

! Phyloki --get_hosts -h

Output

usage: get_hosts.py [-h] -email EMAIL -i INPUT -o OUTPUT

Retrieve host information for a list of accession numbers.

options:
  -h, --help            show this help message and exit
  -email, --email EMAIL
                        User email (required for NCBI API requests)
  -i, --input INPUT     TXT file with the list of accession numbers
  -o, --output OUTPUT   Output file to save host information

Input

! Phyloki --get_hosts -email ivpopov@donstu.ru -i sample_accession_numbers.txt -o accession_host.txt

Output

The request has been fulfilled.
File saved to accession_host.txt

Input

! head -5 accession_host.txt

Output

NC_034519.1 Microtus maximowiczii
NC_055636.1 Microtus agrestis
NC_005225.1 ND
NC_038939.1 Microtus pennsylvanicus
NC_038529.1 Eothenomys miletus

Input

! Phyloki --get_hosts_order -h

Output

usage: get_hosts_order.py [-h] -email EMAIL -i INPUT -o OUTPUT

Fetch taxonomic orders for host organisms based on accession numbers.

options:
  -h, --help            show this help message and exit
  -email, --email EMAIL
                        User email (required for NCBI API requests)
  -i, --input INPUT     TXT file containing accession numbers and host species
  -o, --output OUTPUT   Output file to save accession numbers with their
                        hosts' taxonomic orders

Input

! Phyloki --get_hosts_order -email ivpopov@donstu.ru -i accession_host.txt -o accession_hosts_order.txt

Output

The request has been fulfilled.
File saved to accession_hosts_order.txt
Please do not forget to edit the file manually.
The query to NCBI database from this function is pretty difficult.
Sometimes this function prints:
"Error - HTTP Error 400: Bad Request" in case of bad connection or
"Note - False record" in case there is no record about the host organism.

Input

! head -5 demo_data/accession_order.txt

Output

NC_034519.1	Rodentia
NC_055636.1	Rodentia
NC_005225.1	ND
NC_038939.1	Rodentia
NC_038529.1	Rodentia

API: Example of preparing info for iTOL

Input

from phyloki import dataset4itol as d4i

Input

unique_orders = d4i.get_unique_orders("accession_order.txt")
print(unique_orders)

Output

['Rodentia', 'ND', 'Eulipotyphla', 'Chiroptera', 'Squamata']

Input

color_map = d4i.set_color_map("accession_order.txt")
print(color_map)

Interactive window will open and will ask to set HEX codes for each unique order Alternatively, user can set color_map manually

Input

color_map = {'Chiroptera': '#32cd32', 'Eulipotyphla': '#ffd700', 'Rodentia': '#1e90ff', 'ND': '#FFFFFF', 'Primates': '#8a2be2'}
print(color_map)

Output

{'Rodentia': '#0ca20c', 'ND': '#ffffff', 'Eulipotyphla': '#0078ff', 'Chiroptera': '#000000', 'Squamata': '#ffa500'}

API: Example of creating annotation dataset for iTOL

Using the manually adjusted color map

Input

  1. input txt file with the list of accession numbers and organisms names
  2. input txt file with the list of accession numbers and taxonomic order of microorganism host
  3. output file
  4. manually created color map
d4i.get_itol_dataset("accession_organism.txt", "accession_order.txt", "dataset_for_iTOL.txt", color_map)

Output

Colors were set by the user.
The request has been fulfilled.

Input

! head -5 dataset_for_iTOL.txt

Output

DATASET_COLORSTRIP
SEPARATOR TAB
DATASET_LABEL	Host Group Colors
DATA
NC_034519.1 Orthohantavirus khabarovskense	#0ca20c	Rodentia

Next steps

  1. Visit iTOL
  2. Upload annotated_tree.treefile file as the tree
  3. Upload dataset_for_iTOL.txt as the annotation dataset

Fig 3. Second tree. With annotation info containing organisms names and manually adjusted colors indicating hosts taxonomic order

This is the best tree easily made with Phyloki software

Let's take a look at the original tree again

It can be seen that in original version authors did annotation manually and they made some mistakes in hosts annotation. Phyloki software did not make this mistakes.

Using randomly generated color map

Input

  1. input txt file with the list of accession numbers and organisms names
  2. input txt file with the list of accession numbers and taxonomic order of microorganism host
  3. output file
d4i.get_itol_dataset("accession_organism.txt", "demo_data/accession_order.txt", "demo_data/dataset_for_iTOL_2.txt")

Output

Colors were not set, they were generated randomly.
The request has been fulfilled.

Input

! head -5 demo_data/dataset_for_iTOL_2.txt

Output

DATASET_COLORSTRIP
SEPARATOR TAB
DATASET_LABEL	Host Group Colors
DATA
NC_034519.1 Orthohantavirus khabarovskense	#e31342	Rodentia

Next steps

  1. Visit iTOL
  2. Upload annotated_tree.treefile file as the tree
  3. Upload dataset_for_iTOL_2.txt as the annotation dataset

Fig 4. Third tree. With annotation info containing organisms names and randomly generated colors indicating hosts taxonomic order

In this case random generation played a bad joke! Almost every color is the same. It will be much more convenient to adjust color map manually.