Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taxon not beeing assigned correctly #333

Open
nermze opened this issue Oct 17, 2024 · 6 comments
Open

Taxon not beeing assigned correctly #333

nermze opened this issue Oct 17, 2024 · 6 comments

Comments

@nermze
Copy link

nermze commented Oct 17, 2024

Versions

conda env with python 3.8
poppunk v2.7.0
pp-sketchlib 2.1.3 (also tried 2.0.0)

Command used and output returned

poppunk_assign --db ../Haemophilus_influenzae_v2_refs --query infile.txt --output poppunk_clusters --threads 8

PopPUNK: assign
	(with backend: sketchlib v2.0.0
	 sketchlib: /home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences


Graph-tools OpenMP parallelisation enabled: with 8 threads
Sketching 228 genomes using 8 thread(s)
Progress (CPU): 228 / 228
Writing sketches to file
Loading previously refined model
Completed model loading
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%
Loading network from ../Haemophilus_influenzae_v2_refs/Haemophilus_influenzae_v2_refs.refs_graph.gt
Network loaded: 339 samples
Found novel query clusters. Calculating distances between them.
Calculating all query-query distances
Calculating random match chances using Monte Carlo
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%

Done

poppunk_downgraded_clusters.csv

Describe the bug
Clusters not assigned 1-251, instead 252 and up, see attached file

@johnlees
Copy link
Member

Thanks for raising this.

  • What do you expect the clusters for these samples to be, and why?
  • Does --update-db make any difference?
  • Have you run any of the visualisation tools to see what this looks like on a tree, this could be a helpful diagnostic

@nermze
Copy link
Author

nermze commented Oct 18, 2024

Thanks for raising this.

  • What do you expect the clusters for these samples to be, and why?
  • Does --update-db make any difference?
  • Have you run any of the visualisation tools to see what this looks like on a tree, this could be a helpful diagnostic

Hi, running poppunk_assign with --update-db causes the program to crash. It says the db file is not found in the folder, even though it is present. Maybe its something i have missunderstood?

Poppunk_assign:

poppunk_assign --db Haemophilus_influenzae_v2_refs --query infile.txt --output poppunk_clusters --threads 8 
PopPUNK: assign
	(with backend: sketchlib v2.0.0
	 sketchlib: /home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences


Graph-tools OpenMP parallelisation enabled: with 8 threads
Sketching 228 genomes using 8 thread(s)
Progress (CPU): 228 / 228
Writing sketches to file
Loading previously refined model
Completed model loading
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%
Loading network from Haemophilus_influenzae_v2_refs/Haemophilus_influenzae_v2_refs.refs_graph.gt
Network loaded: 339 samples
Found novel query clusters. Calculating distances between them.
Calculating all query-query distances
Calculating random match chances using Monte Carlo
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%

Done

Now re-running the same with --update-db (the error is the same no matter what db i specify):

poppunk_assign --db Haemophilus_influenzae_v2_refs --query infile.txt --output poppunk_clusters_db_update --threads 8 --update-db
PopPUNK: assign
	(with backend: sketchlib v2.0.0
	 sketchlib: /home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/pp_sketchlib.cpython-38-x86_64-linux-gnu.so)
Mode: Assigning clusters of query sequences


Graph-tools OpenMP parallelisation enabled: with 8 threads
Looking for existing sketches in poppunk_clusters_db_update/poppunk_clusters_db_update.h5
Loading previously refined model
Completed model loading
Calculating distances using 8 thread(s)
Progress (CPU): 100.0%
Loading network from Haemophilus_influenzae_v2_refs/Haemophilus_influenzae_v2_refs_graph.gt
Traceback (most recent call last):
  File "/home/bioinf/miniconda3/envs/poppunk/bin/poppunk_assign", line 11, in <module>
    sys.exit(main())
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/assign.py", line 211, in main
    assign_query(dbFuncs,
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/assign.py", line 307, in assign_query
    isolateClustering = assign_query_hdf5(dbFuncs,
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/assign.py", line 505, in assign_query_hdf5
    fetchNetwork(prev_clustering,
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/network.py", line 113, in fetchNetwork
    genomeNetwork = load_network_file(network_file, use_gpu = use_gpu)
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/PopPUNK/network.py", line 149, in load_network_file
    genomeNetwork = gt.load_graph(fn)
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/graph_tool/__init__.py", line 3666, in load_graph
    g.load(file_name, fmt, ignore_vp, ignore_ep, ignore_gp)
  File "/home/bioinf/miniconda3/envs/poppunk/lib/python3.8/site-packages/graph_tool/__init__.py", line 3165, in load
    with open(file_name) as f: # throw the appropriate exception
FileNotFoundError: [Errno 2] No such file or directory: Haemophilus_influenzae_v2_refs/Haemophilus_influenzae_v2_refs_graph.gt'`

Poppunk_visualisation doesnt produce any results either.

@johnlees
Copy link
Member

Ah right sorry, update-db won't work with the ref only fit, only the full database.

You mentioned in an email you think the v2 database might be the issue. Could you try with v1, which is available here: https://ftp.ebi.ac.uk/pub/databases/pp_dbs/Haemophilus_influenzae_v1_refs.tar.bz2

@nermze
Copy link
Author

nermze commented Oct 18, 2024 via email

@johnlees
Copy link
Member

Ok thanks, I'll look into this at some point soon. I assume using v1 solves your immediate issues?

@nermze
Copy link
Author

nermze commented Oct 18, 2024

Ok thanks, I'll look into this at some point soon. I assume using v1 solves your immediate issues?

Yes, v1 works flawlessly for both reference and full. We would ofcourse like to use v2 for the final publication, but if there are not many changes between them then its fine for now. Thank you for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants