Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis #289

RyanCFink · 2023-11-24T18:33:39Z

Hello,

I'm currently analyzing a substantial dataset comprising 11,743 genomes of different Enterococcus species, including over 20 outgroup genomes from the same family and order. because of the genetic diversity, it's possible that certain genes are unique to individual genomes.
In the PopPUNK fit graph (core vs. accessory distances), we're observing genes with zero accessory distances but core distances spanning from 0.23 to 0.6. This pattern raises questions on the interpretation of the analysis, as I would think that even genes present in a single genome would typically show some level of accessory distance.
I am using dbscan as a model using the following command:
poppunk --fit-model dbscan --K 2 --ref-db <database> --output <database> --threads 16
No error message and I am using poppunk v2.6.0
Here is an example of the scatterplot we obtain for Accessory v. Core distances:

What would be your interpretation of those? could there be a potential issue or a misinterpretation in how the analysis is handling or categorizing these genes? Maybe due to the combination of a large dataset with high genetic diversity?

The text was updated successfully, but these errors were encountered:

johnlees · 2023-11-28T12:08:46Z

The points at zero accessory distance and >0.2 core are likely errors in distance estimation. This overall shape I believe happens when multiple mutations at the same site become common (is that right @nickjcroucher @samhorsfield96?)

I would consider looking at your fits with --plot-fit and consider adjusting your k-mer range. Also consider running the QC options on your data
See:

Finally, you could look at the jaccard distances of those points with sketchlib directly

samhorsfield96 · 2023-11-29T12:13:37Z

Hi both, I believe this shape is due to a high proportion of k-mer mismatches even at the lowest values of k, resulting in distance calculation errors. I would suggest reducing the values of both --min-k (default=13) and --max-k (default=29), and using the --plot-fit function as John suggested to check that the relationship between k and the proportion of matching k-mers is linear.

RyanCFink · 2023-11-29T14:31:24Z

Thank you for your replies! These are very useful for me to understand what might be going on.
The hypothesis that the k-mer mismatches are the source of those weird shapes is supported by the fact that a set of about 20 genomes are phylogenetically quite distant from the rest (Family and Order level).
I got the info on the database we are using for the analyses and I am posting them at the end of this comment.
The k-mer sizes are spanning the defaults. Do you suggest decreasing them anyway?
I am having problems running the --plot-fit command. I have used different arguments, from 5 to 1,137 but I didn't get output or error messages, the database was correctly fit to a model. I have tried also with smaller databases and those also didn't produce any output. Is the extension of the graphic output .png?
We are in the process of re-creating the database without those outgroups. I thought there was a way to easily remove genomes from the database, so I've been looking for it, but I couldn't find it (only ways to add or update the database).

Sketch version:			ed862592a54de54d4b556a5e0530096c85f3a08d
Number of samples:		11373
K-mer sizes:			13,17,21,25,29
Sketch size:			9984
Contains random matches:	True
Codon phased seeds:		False

johnlees · 2023-12-05T17:23:19Z

The k-mer sizes are spanning the defaults. Do you suggest decreasing them anyway?

I would try using --k-step 3 --min-k 17 --max-k 41. But the plots would be useful

I am having problems running the --plot-fit command. I have used different arguments, from 5 to 1,137 but I didn't get output or error messages, the database was correctly fit to a model. I have tried also with smaller databases and those also didn't produce any output. Is the extension of the graphic output .png?

I think they are PDF files if I recall. Can you list the files in your output directory?

We are in the process of re-creating the database without those outgroups. I thought there was a way to easily remove genomes from the database, so I've been looking for it, but I couldn't find it (only ways to add or update the database).

Use poppunk --qc-db --ref-db example_db --remove-samples samples_to_remove.txt, see QC docs.

RyanCFink · 2023-12-14T13:55:58Z

I would try using --k-step 3 --min-k 17 --max-k 41. But the plots would be useful
We are trying the different k-mers steps and thresholds right now. But we're having the same problem with --plot-fit even with other datasets. The following is the list of files that we get when we use it:

We recreated the database without those outgroups and there are stil some zero distances. However the taxon we are analyzing still has a very large diversity and there might still be a good number of k-mers that don't match.

Use poppunk --qc-db --ref-db example_db --remove-samples samples_to_remove.txt, see [QC docs]

I thought this was only to be used at the database creation step, can it be used also to exclude samples from the model fitting step?

johnlees · 2023-12-14T14:16:20Z

But we're having the same problem with --plot-fit even with other datasets. The following is the list of files that we get when we use it

Can you try with the same --ref-db and --output. Or also have a look in poppunk_db/ as that's the folder I'd expect the plots to be in.

I thought this was only to be used at the database creation step, can it be used also to exclude samples from the model fitting step?

Yes, this will prune them out of the database. If you then run the model fitting step on that database (the --ref-db folder used) it will not use those samples

RyanCFink · 2024-01-10T19:31:53Z

I have repeated it numerous times now, but the pdf is nowhere to be found, unfortunately

johnlees · 2024-01-12T11:36:37Z

I have repeated it numerous times now, but the pdf is nowhere to be found, unfortunately

Unfortunately it's hard to provide more help with the plot files, we can't replicate and I'm not sure what the problem might be. My only guess would be something to do with the matplotlib backend. Perhaps another computer/environment/install could shed light on things, or seeing if you get the same issue with the example dataset.

You can also make these plots yourself if you use the jaccard output of sketchlib dist jaccard -- if you log the output this gives the y for the points plotted, the x is the k-mer length.

Going back to your original question, I would also try running the QC mode and setting a max pi/core threshold of 0.22.

johnlees added the question Further information is requested label Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis #289

Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis #289

RyanCFink commented Nov 24, 2023 •

edited

Loading

johnlees commented Nov 28, 2023

samhorsfield96 commented Nov 29, 2023

RyanCFink commented Nov 29, 2023 •

edited

Loading

johnlees commented Dec 5, 2023

RyanCFink commented Dec 14, 2023 •

edited

Loading

johnlees commented Dec 14, 2023

RyanCFink commented Jan 10, 2024

johnlees commented Jan 12, 2024

Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis #289

Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis #289

Comments

RyanCFink commented Nov 24, 2023 • edited Loading

johnlees commented Nov 28, 2023

samhorsfield96 commented Nov 29, 2023

RyanCFink commented Nov 29, 2023 • edited Loading

johnlees commented Dec 5, 2023

RyanCFink commented Dec 14, 2023 • edited Loading

johnlees commented Dec 14, 2023

RyanCFink commented Jan 10, 2024

johnlees commented Jan 12, 2024

RyanCFink commented Nov 24, 2023 •

edited

Loading

RyanCFink commented Nov 29, 2023 •

edited

Loading

RyanCFink commented Dec 14, 2023 •

edited

Loading