-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero Accessory Distances and Varied Core Distances in PopPUNK Fit Graph Analysis #289
Comments
The points at zero accessory distance and >0.2 core are likely errors in distance estimation. This overall shape I believe happens when multiple mutations at the same site become common (is that right @nickjcroucher @samhorsfield96?) I would consider looking at your fits with
Finally, you could look at the jaccard distances of those points with sketchlib directly |
Hi both, I believe this shape is due to a high proportion of k-mer mismatches even at the lowest values of k, resulting in distance calculation errors. I would suggest reducing the values of both |
Thank you for your replies! These are very useful for me to understand what might be going on.
|
I would try using
I think they are PDF files if I recall. Can you list the files in your output directory?
Use |
We recreated the database without those outgroups and there are stil some zero distances. However the taxon we are analyzing still has a very large diversity and there might still be a good number of k-mers that don't match.
I thought this was only to be used at the database creation step, can it be used also to exclude samples from the model fitting step? |
Can you try with the same
Yes, this will prune them out of the database. If you then run the model fitting step on that database (the |
I have repeated it numerous times now, but the pdf is nowhere to be found, unfortunately |
Unfortunately it's hard to provide more help with the plot files, we can't replicate and I'm not sure what the problem might be. My only guess would be something to do with the matplotlib backend. Perhaps another computer/environment/install could shed light on things, or seeing if you get the same issue with the example dataset. You can also make these plots yourself if you use the jaccard output of Going back to your original question, I would also try running the QC mode and setting a max pi/core threshold of 0.22. |
Hello,
I'm currently analyzing a substantial dataset comprising 11,743 genomes of different Enterococcus species, including over 20 outgroup genomes from the same family and order. because of the genetic diversity, it's possible that certain genes are unique to individual genomes.
In the PopPUNK fit graph (core vs. accessory distances), we're observing genes with zero accessory distances but core distances spanning from 0.23 to 0.6. This pattern raises questions on the interpretation of the analysis, as I would think that even genes present in a single genome would typically show some level of accessory distance.
I am using dbscan as a model using the following command:
poppunk --fit-model dbscan --K 2 --ref-db <database> --output <database> --threads 16
No error message and I am using poppunk v2.6.0
Here is an example of the scatterplot we obtain for Accessory v. Core distances:
What would be your interpretation of those? could there be a potential issue or a misinterpretation in how the analysis is handling or categorizing these genes? Maybe due to the combination of a large dataset with high genetic diversity?
The text was updated successfully, but these errors were encountered: