-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support gRNA1 error tolerance #18
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great -- really well made! I only have a couple follow ups (which you may have already been planning to do in future PRs)
- Has
itertools
been added to the requirements.txt? - Are you planning on making unit tests for this functionality? Or is that in the next PR?
- Can we add documentation that will help people understand when or when not to allow for errors in the guide RNAs? Like what kinds of evaluations or signs would one look for to know what to decide?
|
||
if gRNA_error_tolerance: | ||
for gRNA in gRNAs: | ||
# go from high subs to low subs to prefer better alignments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what this comment means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is unnecessary because each mutated gRNA is unique and only generated once, but going from high substitutions to low substitutions means we end up preferring the mutation with lower amount of substitutions. Again though, it's confusing because I actually wrote the code to generate each mutant gRNA just once, but I thought maybe I should keep this reverse iteration order just because it's more robust to how the perturbations are being generated.
itertools is a built in library so I don't think we need to touch this?
Unit tests are included here in
Yes this is an interesting question, though right now I'm not sure what guidance to provide to be honest and in literature it seems sort of arbitrary. I hope to address this as part of general "paired guide CRISPR screening methods" once we have run pgmap on many datasets and analyzed the differences between the error tolerances and how that affects genetic interaction scores. Previously we had discussed using FASTQC to help inform this, but my recent literature search and investigation are pointing to the idea that the majority of gRNA mismatches are actually caused by mutations prior to sequencing. This doesn't have a well described effect on GI scores or a best practices way of setting these error tolerances as far as I know. |
Description
Previously the pgmap algorithm had taken only perfect gRNA1 matches. This change adds support for gRNA1 error tolerance of up to 2 substitutions. The algorithm involved precalculates and caches every mutated gRNA in the library. As such it's memory and time complexity is O(c * n^k) where c is the number of unique gRNAs, n is the length of the gRNAs and k is the number of errors tolerated. With error tolerance of 2 this uses around 20 seconds to precalculate the cache and <5 gb of memory. Error tolerances over 2 result in error, but in principle could be supported on machines with enough memory.
Type of change
How Has This Been Tested?
Added unit tests. Run with
pip install . && python3 -m tests
Checklist: