Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gRNA1 error tolerance #18

Merged
merged 8 commits into from
Jan 28, 2025
Merged

Conversation

marissafujimoto
Copy link
Collaborator

Description

Previously the pgmap algorithm had taken only perfect gRNA1 matches. This change adds support for gRNA1 error tolerance of up to 2 substitutions. The algorithm involved precalculates and caches every mutated gRNA in the library. As such it's memory and time complexity is O(c * n^k) where c is the number of unique gRNAs, n is the length of the gRNAs and k is the number of errors tolerated. With error tolerance of 2 this uses around 20 seconds to precalculate the cache and <5 gb of memory. Error tolerances over 2 result in error, but in principle could be supported on machines with enough memory.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Added unit tests. Run with pip install . && python3 -m tests

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great -- really well made! I only have a couple follow ups (which you may have already been planning to do in future PRs)

  1. Has itertools been added to the requirements.txt?
  2. Are you planning on making unit tests for this functionality? Or is that in the next PR?
  3. Can we add documentation that will help people understand when or when not to allow for errors in the guide RNAs? Like what kinds of evaluations or signs would one look for to know what to decide?


if gRNA_error_tolerance:
for gRNA in gRNAs:
# go from high subs to low subs to prefer better alignments
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what this comment means?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is unnecessary because each mutated gRNA is unique and only generated once, but going from high substitutions to low substitutions means we end up preferring the mutation with lower amount of substitutions. Again though, it's confusing because I actually wrote the code to generate each mutant gRNA just once, but I thought maybe I should keep this reverse iteration order just because it's more robust to how the perturbations are being generated.

@marissafujimoto
Copy link
Collaborator Author

marissafujimoto commented Jan 27, 2025

@cansavvy

  1. Has itertools been added to the requirements.txt?

itertools is a built in library so I don't think we need to touch this?

  1. Are you planning on making unit tests for this functionality? Or is that in the next PR?

Unit tests are included here in tests/__main__.py. Let me know if you have thoughts about how to test this better. Right now I just tested the generation of the alignment cache and that the greater error tolerance gives more counts than a lower one, but I could also add a test with specifically constructed data to really test the alignment tolerances.

  1. Can we add documentation that will help people understand when or when not to allow for errors in the guide RNAs? Like what kinds of evaluations or signs would one look for to know what to decide?

Yes this is an interesting question, though right now I'm not sure what guidance to provide to be honest and in literature it seems sort of arbitrary. I hope to address this as part of general "paired guide CRISPR screening methods" once we have run pgmap on many datasets and analyzed the differences between the error tolerances and how that affects genetic interaction scores. Previously we had discussed using FASTQC to help inform this, but my recent literature search and investigation are pointing to the idea that the majority of gRNA mismatches are actually caused by mutations prior to sequencing. This doesn't have a well described effect on GI scores or a best practices way of setting these error tolerances as far as I know.

@marissafujimoto marissafujimoto marked this pull request as ready for review January 27, 2025 21:05
@marissafujimoto marissafujimoto merged commit 467b911 into main Jan 28, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants