Support gRNA1 error tolerance #18

marissafujimoto · 2025-01-22T21:13:37Z

Description

Previously the pgmap algorithm had taken only perfect gRNA1 matches. This change adds support for gRNA1 error tolerance of up to 2 substitutions. The algorithm involved precalculates and caches every mutated gRNA in the library. As such it's memory and time complexity is O(c * n^k) where c is the number of unique gRNAs, n is the length of the gRNAs and k is the number of errors tolerated. With error tolerance of 2 this uses around 20 seconds to precalculate the cache and <5 gb of memory. Error tolerances over 2 result in error, but in principle could be supported on machines with enough memory.

Type of change

New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Added unit tests. Run with pip install . && python3 -m tests

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

cansavvy

This is great -- really well made! I only have a couple follow ups (which you may have already been planning to do in future PRs)

Has itertools been added to the requirements.txt?
Are you planning on making unit tests for this functionality? Or is that in the next PR?
Can we add documentation that will help people understand when or when not to allow for errors in the guide RNAs? Like what kinds of evaluations or signs would one look for to know what to decide?

cansavvy · 2025-01-27T19:48:12Z

src/pgmap/alignment/grna_cached_aligner.py

+
+    if gRNA_error_tolerance:
+        for gRNA in gRNAs:
+            # go from high subs to low subs to prefer better alignments


Can you explain what this comment means?

I think this comment is unnecessary because each mutated gRNA is unique and only generated once, but going from high substitutions to low substitutions means we end up preferring the mutation with lower amount of substitutions. Again though, it's confusing because I actually wrote the code to generate each mutant gRNA just once, but I thought maybe I should keep this reverse iteration order just because it's more robust to how the perturbations are being generated.

marissafujimoto · 2025-01-27T21:00:16Z

@cansavvy

Has itertools been added to the requirements.txt?

itertools is a built in library so I don't think we need to touch this?

Are you planning on making unit tests for this functionality? Or is that in the next PR?

Unit tests are included here in tests/__main__.py. Let me know if you have thoughts about how to test this better. Right now I just tested the generation of the alignment cache and that the greater error tolerance gives more counts than a lower one, but I could also add a test with specifically constructed data to really test the alignment tolerances.

Can we add documentation that will help people understand when or when not to allow for errors in the guide RNAs? Like what kinds of evaluations or signs would one look for to know what to decide?

Yes this is an interesting question, though right now I'm not sure what guidance to provide to be honest and in literature it seems sort of arbitrary. I hope to address this as part of general "paired guide CRISPR screening methods" once we have run pgmap on many datasets and analyzed the differences between the error tolerances and how that affects genetic interaction scores. Previously we had discussed using FASTQC to help inform this, but my recent literature search and investigation are pointing to the idea that the majority of gRNA mismatches are actually caused by mutations prior to sequencing. This doesn't have a well described effect on GI scores or a best practices way of setting these error tolerances as far as I know.

marissafujimoto added 8 commits January 14, 2025 13:47

Add gRNA1 error tolerance

7094aa9

Add gRNA error tolerance error conditions

ebe1083

Use cached aligner for gRNA2

26017c4

Add gRNA cached aligner docs

383366e

Make cli error bound constraints more specific

bca28b0

Merge branch 'main' into gRNA1-error-tolerance

4d8ad90

Support gRNA error tolerance up to 2

0a6b19a

Add max error tolerance test

4767392

marissafujimoto requested a review from cansavvy January 22, 2025 21:13

cansavvy approved these changes Jan 27, 2025

View reviewed changes

marissafujimoto marked this pull request as ready for review January 27, 2025 21:05

marissafujimoto merged commit 467b911 into main Jan 28, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support gRNA1 error tolerance #18

Support gRNA1 error tolerance #18

marissafujimoto commented Jan 22, 2025

cansavvy left a comment

cansavvy Jan 27, 2025

marissafujimoto Jan 27, 2025

marissafujimoto commented Jan 27, 2025 •

edited

Loading

Support gRNA1 error tolerance #18

Support gRNA1 error tolerance #18

Conversation

marissafujimoto commented Jan 22, 2025

Description

Type of change

How Has This Been Tested?

Checklist:

cansavvy left a comment

Choose a reason for hiding this comment

cansavvy Jan 27, 2025

Choose a reason for hiding this comment

marissafujimoto Jan 27, 2025

Choose a reason for hiding this comment

marissafujimoto commented Jan 27, 2025 • edited Loading

marissafujimoto commented Jan 27, 2025 •

edited

Loading