Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add filtering #62

Merged
merged 6 commits into from
Sep 30, 2024
Merged

add filtering #62

merged 6 commits into from
Sep 30, 2024

Conversation

kweav
Copy link
Collaborator

@kweav kweav commented Aug 29, 2024

Description

This is the beginning of a series of stacked PRs that are addressing NAs getting propagated throughout the code from normalization through CRISPR score and genetic interaction score calculation. In investigating issue #55, I found that NAs were being introduced for the test dataset because there were 2 input data pgRNA IDs that didn't have corresponding data in the annotation. So once the left_join() in the annotation step was performed, NAs were present.

The goal for addressing this issue is to

  • by default: report which IDs don't have corresponding annotation data & then filter the data for those constructs out before proceeding with the analysis
    • Will report to console for a certain number or smaller of IDs (default is 20, but can be set up user using num_ids_wo_annot
    • Will write out to a file for larger than that certain number (still need to add this functionality -- currently a placeholder comment)
    • filtering/removing those IDs and proceeding with the analysis is controlled by the argument rm_ids_wo_annot which by default is TRUE
    • (still need to add descriptions of these parameters to the vignette)
  • Alternately, the user can choose to keep that data within the analysis (set rm_ids_wo_annot to FALSE) and then the analysis will proceed but ignore NAs within addition, subtraction, median, and mean calculations.
    • Need to add behavior to proceed
    • Need to add a function for robust addition/subtraction/median/mean calculations that ignore NAs
    • Need to incorporate that function throughout the CRISPR and GI score calculation functions
    • Want to add an argument that by default for CRISPR and GI score calculation functions says not to ignore NAs and the user has to switch it so that NAs are ignored
    • Need to add to documentation about these changes

The checkmarked items above are what this PR specifically takes on which is reporting to console which IDs are filtered, filters them, and moves on with the analysis.

Next steps

The uncheckmarked items above will be tackled in stacked PRs down the line.

Fixes #55

How Has This Been Tested?

I've tested this fix locally on my computer using devtools to load the gimap changes and ran through the test data from Sita without error messages.

Feedback requested

  • Do you like the planned course of action? Are there changes you would make?
  • Could you also verify that this code runs the test data without error for you?

Thanks!!

@kweav kweav requested a review from cansavvy August 29, 2024 20:57
@kweav
Copy link
Collaborator Author

kweav commented Sep 23, 2024

@cansavvy as discussed earlier ...
First additional commit uses readr::write_csv to write to a file if there are a greater than [user specified parameter] number of IDs that aren't found in the annotation.

Second additional commit was after running devtools::document() since I added a couple of arguments.

Still need to discuss next steps (including what to do if rm_ids_wo_annot is FALSE (not default). Right now it continues the analysis, but we'll run into the errors we've seen before unless all calculations are made more robust to ignore NA values rather than propagating them.

Still also should probably adjust the comment to tell the user to check if their input data IDs are miscoded/follow the pattern they expect?

@cansavvy
Copy link
Collaborator

@cansavvy as discussed earlier ... First additional commit uses readr::write_csv to write to a file if there are a greater than [user specified parameter] number of IDs that aren't found in the annotation.

Perfect! thanks for doing this! Sorry for my delay in responding.

Second additional commit was after running devtools::document() since I added a couple of arguments.

I believe/supposedly you shouldn't technically have to re-run this because the GitHub actions will make sure documents are up to date. But it certainly doesn't hurt anything to do so. I'm not 100% sure about this but can make it so if the GitHub Actions here don't already do this.

Still need to discuss next steps (including what to do if rm_ids_wo_annot is FALSE (not default). Right now it continues the analysis, but we'll run into the errors we've seen before unless all calculations are made more robust to ignore NA values rather than propagating them.

I'll work on this part.

Still also should probably adjust the comment to tell the user to check if their input data IDs are miscoded/follow the pattern they expect?

Yeah I can adjust that comment in the next PR.

@cansavvy
Copy link
Collaborator

Seeing if I can get these errors resolved and then I'll merge and start fresh on the next part of this:

2024-09-23 19:56:55 (26.6 MB/s) - ‘/private/var/folders/0g/hj_q_pzx65bbjnslxz9n0src0000gn/T/RtmpashkeG/Rinste79625282f1/gimap/extdata/CCLE_gene_cn.csv’ saved [840532069/840532069]
Quitting from lines 209-216 [unnamed-chunk-17] (getting_started.Rmd)
Error: Error: processing vignette 'getting_started.Rmd' failed with diagnostics:
is.data.frame(x) is not TRUE
--- failed re-building ‘getting_started.Rmd’
SUMMARY: processing the following file failed:
‘getting_started.Rmd’
Error: Error: Vignette re-building failed.
Execution halted

@cansavvy cansavvy merged commit 68c0f98 into main Sep 30, 2024
4 of 7 checks passed
@cansavvy cansavvy deleted the handle_lm_issue branch September 30, 2024 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error message while calculating CRISPR score on new data from Sita
2 participants