-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: refactor csq analysis code #599
Conversation
WalkthroughThe pull request introduces significant enhancements to the Changes
Possibly related PRs
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
src/annotate/seqvars/csq.rs (2)
716-817
: Consider refactoring functions with many parametersThe functions
consequences_fix_special_cases
,analyze_intronic_variant
,analyze_cds_variant
, andanalyze_protein_variant
have numerous parameters, which can impact readability and maintainability. Grouping related parameters into structs or utilizing builder patterns can simplify function signatures.Also applies to: 819-932, 934-990, 992-1080
716-817
: Recommend adding unit tests for new analysis functionsGiven the complexity of genomic variant consequence prediction, adding unit tests for
analyze_intronic_variant
,analyze_cds_variant
,analyze_protein_variant
, andconsequences_fix_special_cases
would help ensure correctness and catch potential edge cases.Also applies to: 819-932, 934-990, 992-1080
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(8 hunks)
🔇 Additional comments (3)
src/annotate/seqvars/csq.rs (3)
104-104
: Introduction of Consequences
type alias
Defining pub type Consequences = BitFlags<Consequence>;
improves code readability and reduces verbosity throughout the codebase.
466-479
: Refactored intronic variant analysis into a separate function
The introduction of the analyze_intronic_variant
function enhances modularity and makes the code more maintainable and testable by encapsulating intronic variant logic.
625-644
: Improved code organization with specialized analysis functions
Delegating CDS and protein variant analyses to analyze_cds_variant
and analyze_protein_variant
respectively enhances code clarity and separation of concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
src/annotate/seqvars/csq.rs (3)
716-817
: Consider simplifying 'consequences_fix_special_cases' for readabilityThe
consequences_fix_special_cases
function contains nested conditionals and complex logic that may impact readability. Refactoring the code or adding comments to explain intricate sections can enhance understanding and maintainability.
732-755
: Reduce code duplication in stop lost handlingThe handling of
StopLost
consequences from CDS and protein analyses has code duplication. Consider abstracting the common logic into a helper function to eliminate redundancy.Apply this diff to refactor the code:
if consequences_cds.contains(Consequence::StopLost) && !consequences_protein.contains(Consequence::StopLost) { *consequences &= !Consequence::StopLost; + self.handle_stop_lost_variant(consequences, var_p); }
And define the helper method:
fn handle_stop_lost_variant(&self, consequences: &mut Consequences, var_p: &HgvsVariant) { if let HgvsVariant::ProtVariant { loc_edit: ProtLocEdit::Ordinary { loc, edit }, .. } = var_p { let loc_length = Range::<i32>::from(loc.inner().clone()).len(); if let ProteinEdit::DelIns { alternative } = edit.inner() { match alternative.len().cmp(&loc_length) { Ordering::Equal => { *consequences |= Consequence::StopRetainedVariant; } Ordering::Greater => { *consequences |= Consequence::FeatureElongation; } _ => {} } } } }
1163-1183
: Clarify the logic in 'is_conservative_cds_variant'The conditions in
is_conservative_cds_variant
are complex. Adding comments or simplifying the logic can improve readability and help future maintainers understand the criteria for a conservative variant.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(8 hunks)
🔇 Additional comments (8)
src/annotate/seqvars/csq.rs (8)
104-105
: Introduce 'Consequences' type alias for clarity
Defining Consequences
as a type alias for BitFlags<Consequence>
enhances code readability and simplifies type annotations throughout the codebase.
380-380
: Correctly initialize 'consequences' with an empty set
Initializing consequences
using Consequences::empty()
is appropriate and ensures that no unintended flags are set before accumulation.
466-478
: Enhance modularity by extracting intronic variant analysis
The addition of analyze_intronic_variant
improves code modularity by encapsulating intronic variant logic. This makes the build_ann_field
method cleaner and more maintainable.
625-644
: Refactor CDS and protein variant analysis into separate methods
Separating the CDS and protein variant analyses into analyze_cds_variant
and analyze_protein_variant
enhances code organization and readability. This refactoring allows for easier maintenance and testing of individual components.
636-644
: Handle special cases with dedicated method
Introducing consequences_fix_special_cases
centralizes the logic for adjusting consequences based on specific conditions. This improves maintainability by isolating complex logic into a dedicated method.
819-932
: Verify boundary conditions in 'analyze_intronic_variant'
Double-check the boundary calculations for splice donor/acceptor sites, splice regions, polypyrimidine tracts, and donor regions to ensure accurate identification of intronic variants. This will help prevent off-by-one errors that could affect variant classification.
997-1002
: Ensure 'is_stop' function covers all stop codon representations
The is_stop
function checks for "X"
, "Ter"
, and "*"
. Verify that all possible representations of stop codons are included, such as "Stop"
or lowercased versions.
Consider updating the function:
fn is_stop(s: &str) -> bool {
- s == "X" || s == "Ter" || s == "*"
+ matches!(s.to_uppercase().as_str(), "X" | "TER" | "*" | "STOP")
}
904-932
: Potential off-by-one error in polypyrimidine tract range
The range for the polypyrimidine tract might have an off-by-one error. Confirm that the start and end positions correctly represent the intended nucleotide positions.
Run the following script to inspect the range calculations:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
src/annotate/seqvars/csq.rs (3)
716-820
: Consider enhancing error handling in special casesThe
consequences_fix_special_cases
method handles important edge cases for start/stop codons well. However, consider:
- Adding error handling for the sequence extraction at line 793
- Adding debug logging for special case detection
- Consider extracting the start/stop codon logic into separate methods for better maintainability
937-993
: Consider adding validation for CDS positionsThe CDS variant analysis is thorough but could benefit from additional validation:
- Add validation for CDS positions to ensure they're within valid ranges
- Consider adding debug assertions for invariants
1166-1186
: Add tests for conservative CDS variant detectionWhile test coverage is generally good, consider adding specific tests for:
- Edge cases in
is_conservative_cds_variant
- Various codon boundary conditions
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(8 hunks)
🔇 Additional comments (3)
src/annotate/seqvars/csq.rs (3)
104-104
: LGTM: Clear type alias definition
The type alias Consequences
for BitFlags<Consequence>
improves code readability and maintainability.
822-935
: LGTM: Well-structured intronic variant analysis
The intronic variant analysis is comprehensive and handles all relevant cases:
- Splice acceptor/donor sites
- Splice regions
- Polypyrimidine tract
995-1088
: Refactor stop codon detection
The protein variant analysis has a duplicate implementation of stop codon detection. As noted in the past review comments, consider unifying the implementation:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
src/annotate/seqvars/csq.rs (2)
821-934
: LGTM! Well-structured intronic variant analysis.The function thoroughly handles different types of splice site variants with clear comments explaining each case. The implementation correctly considers strand orientation and position-specific effects.
Consider extracting magic numbers into named constants for better maintainability:
+ const SPLICE_REGION_EXON_OFFSET: i32 = 3; + const SPLICE_REGION_INTRON_OFFSET: i32 = 8; + const POLYPYRIMIDINE_TRACT_OFFSET: i32 = 17;
716-819
: LGTM! Well-documented special case handling.The function handles edge cases for start/stop codon variants with clear examples. Good use of pattern matching and careful consideration of sequence context.
Consider adding logging for special case handling to aid in debugging and monitoring.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(9 hunks)
🔇 Additional comments (5)
src/annotate/seqvars/csq.rs (5)
104-104
: LGTM! Good type alias addition.
The Consequences
type alias improves code readability and maintainability by providing a clear, semantic name for the BitFlags<Consequence>
type.
1006-1008
: LGTM! Good implementation of has_stop
.
The has_stop
function was added based on past review feedback to improve consistency in stop codon detection. The implementation correctly checks for all stop codon indicators ('*', 'X', 'Ter').
936-992
: LGTM! Comprehensive CDS variant analysis.
The function correctly handles UTR regions and start/stop codon effects. Good use of pattern matching for variant analysis.
994-1088
: LGTM! Thorough protein variant analysis.
The function handles all protein variant types comprehensively with good documentation. Particularly good handling of frameshift, extension, and substitution cases.
1166-1186
: LGTM! Clear conservative CDS variant detection.
The function correctly determines if a CDS variant is conservative based on codon boundaries. Good documentation explaining the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
src/annotate/seqvars/csq.rs (3)
716-819
: Consider adding documentation for edge cases.The method handles special cases for start/stop codons, but the documentation could be improved by:
- Adding examples for each edge case
- Explaining why certain fixes are needed
- Documenting the relationship between CDS and protein consequences
936-992
: Consider extracting UTR-related logic.The CDS variant analysis mixes UTR handling with CDS analysis. Consider extracting UTR-related logic into a separate method for better maintainability.
Line range hint
1804-1953
: Refactor test comparison logic for better maintainability.The current implementation has several issues:
- Long chain of boolean conditions makes it hard to understand and maintain
- Special cases are mixed with general comparison logic
- Comments are scattered throughout the code
Consider:
- Creating an enum or struct for different tool comparisons
- Extracting special cases into a separate function
- Using a more declarative approach for comparisons
Example approach:
#[derive(Debug)] enum ComparisonTool { VEP, SnpEff, } struct ConsequenceComparator { tool: ComparisonTool, special_cases: HashMap<String, Vec<String>>, } impl ConsequenceComparator { fn new(tool: ComparisonTool) -> Self { let mut special_cases = HashMap::new(); // Initialize special cases based on tool Self { tool, special_cases } } fn is_equivalent(&self, expected: &[&str], actual: &[&str]) -> bool { // Implement comparison logic here } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(10 hunks)
🔇 Additional comments (3)
src/annotate/seqvars/csq.rs (3)
104-104
: LGTM! Good use of type alias for better code readability.
The type alias Consequences = BitFlags<Consequence>
improves code clarity and maintainability.
821-934
: LGTM! Well-structured intronic variant analysis.
The method thoroughly handles different types of splice variants with clear comments explaining each case.
994-1088
: Move helper functions to a separate module.
The is_stop
and has_stop
functions are utility functions that could be moved to a separate module for reuse.
Note: This suggestion aligns with a previous review comment about consistent stop codon detection.
fn overlaps(var_start: i32, var_end: i32, exon_intron_start: i32, exon_intron_end: i32) -> bool { | ||
(var_start < exon_intron_end) && (var_end > exon_intron_start) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add input validation to the overlaps function.
The function is critical for determining variant positions but lacks validation for invalid inputs (e.g., end < start).
Apply this diff to add validation:
fn overlaps(var_start: i32, var_end: i32, exon_intron_start: i32, exon_intron_end: i32) -> bool {
+ debug_assert!(var_start <= var_end, "variant start must not be after end");
+ debug_assert!(
+ exon_intron_start <= exon_intron_end,
+ "exon/intron start must not be after end"
+ );
(var_start < exon_intron_end) && (var_end > exon_intron_start)
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
fn overlaps(var_start: i32, var_end: i32, exon_intron_start: i32, exon_intron_end: i32) -> bool { | |
(var_start < exon_intron_end) && (var_end > exon_intron_start) | |
} | |
fn overlaps(var_start: i32, var_end: i32, exon_intron_start: i32, exon_intron_end: i32) -> bool { | |
debug_assert!(var_start <= var_end, "variant start must not be after end"); | |
debug_assert!( | |
exon_intron_start <= exon_intron_end, | |
"exon/intron start must not be after end" | |
); | |
(var_start < exon_intron_end) && (var_end > exon_intron_start) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
src/annotate/seqvars/csq.rs (3)
699-802
: Consider encapsulating parameters in a structThe method has 8 parameters which makes it hard to maintain and use correctly. Consider creating a dedicated struct to hold these parameters.
struct ConsequenceFixContext<'a> { consequences: &'a mut Consequences, consequences_cds: Consequences, consequences_protein: Consequences, var_g: &'a HgvsVariant, var_n: &'a HgvsVariant, var_c: &'a HgvsVariant, var_p: &'a HgvsVariant, } fn consequences_fix_special_cases(&self, ctx: ConsequenceFixContext) { // ... existing implementation using ctx... }
1019-1019
: Consider implementing the suggested move to hgvs-rs libraryThe TODO comment suggests moving the
is_stop
function to the hgvs-rs library. This would be a good improvement for code organization.Would you like me to help create a GitHub issue to track this improvement?
Line range hint
1822-1971
: Consider improving test case maintainabilityThe test validation logic contains many hardcoded tool-specific comparisons. Consider:
- Extracting tool-specific mappings into a configuration structure
- Converting magic numbers (like line 1837's ".vep") into named constants
struct ToolSpecificMappings { tool_name: &'static str, consequence_mappings: HashMap<&'static str, Vec<&'static str>>, special_cases: HashMap<&'static str, Vec<&'static str>>, } const VEP_TOOL: &str = ".vep"; const SNPEFF_TOOL: &str = ".snpeff";
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (2)
src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_snv_brca1_one_variant@17-41197818-G-C.snap
is excluded by!**/*.snap
src/annotate/seqvars/snapshots/mehari__annotate__seqvars__csq__test__annotate_snv_brca1_one_variant@17-41197819-G-C.snap
is excluded by!**/*.snap
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(10 hunks)
🔇 Additional comments (3)
src/annotate/seqvars/csq.rs (3)
104-104
: LGTM! Good use of type alias
The type alias Consequences = BitFlags<Consequence>
improves code readability and maintainability by clearly indicating the purpose of the flags.
1012-1106
: LGTM! Well-structured variant analysis
The method provides comprehensive analysis of protein variants with clear pattern matching and good separation of concerns.
1206-1208
:
Add input validation for overlaps function
The overlaps function is critical for position checking but lacks validation. Consider adding assertions for the input ranges.
fn overlaps(var_start: i32, var_end: i32, exon_intron_start: i32, exon_intron_end: i32) -> bool {
+ debug_assert!(var_start <= var_end, "variant start must not be after end");
+ debug_assert!(
+ exon_intron_start <= exon_intron_end,
+ "exon/intron start must not be after end"
+ );
(var_start < exon_intron_end) && (var_end > exon_intron_start)
}
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
src/annotate/seqvars/csq.rs (3)
699-802
: Consider adding documentation for edge casesThe
consequences_fix_special_cases
method handles important edge cases for start/stop codons, but the documentation could be enhanced to better explain each case.Add documentation for each edge case, for example:
/// Fix special cases in consequence prediction. +/// +/// # Special Cases +/// +/// ## Stop Lost vs Stop Retained +/// When a stop lost is predicted based on CDS but not confirmed by protein translation, +/// we need to check if this is actually a stop retained variant. +/// Example: 20:35511609:CAAGCCGCCTCCAGGTAGCAGCCACAGCCAGGAGCACACAGACAGAAGACTGTGTCATGGGTCATGGCCCCTCCGCACACCTACAGGTTTGCCAAAGGAA:C +/// +/// ## Start Lost vs Start Retained +/// Similar to stop lost, we need to verify start lost predictions against protein translation. +/// Example: 13:32316456:TA:T
Line range hint
1850-2005
: Consider refactoring consequence comparison logicThe consequence comparison logic contains numerous special cases and workarounds for differences between tools (VEP and SnpEff). This makes the code harder to maintain and understand.
Consider:
- Extracting the comparison logic into a separate struct/module
- Creating a mapping between different tool's consequence terms
- Using a more declarative approach for consequence equivalence
Example structure:
struct ConsequenceMapper { /// Maps consequence terms between different tools term_mappings: HashMap<String, HashSet<String>>, } impl ConsequenceMapper { fn new() -> Self { let mut mapper = Self { term_mappings: HashMap::new(), }; // Define equivalences mapper.add_equivalent_terms( "disruptive_inframe_deletion", &["inframe_deletion", "protein_altering_variant"], ); // ... more mappings ... mapper } fn are_equivalent(&self, term1: &str, term2: &str) -> bool { // Check direct equality or mapped equivalence } }
1020-1026
: Add documentation for stop codon helper functionsThe
is_stop
andhas_stop
functions are well-implemented but could benefit from more detailed documentation.Add documentation explaining the difference between the functions:
+/// Checks if a string exactly matches a stop codon representation +/// +/// # Arguments +/// * `s` - The string to check +/// +/// # Returns +/// `true` if the string is exactly "*", "X", or "Ter" fn is_stop(s: &str) -> bool { s == "X" || s == "Ter" || s == "*" } +/// Checks if a string contains any stop codon representation +/// +/// # Arguments +/// * `s` - The string to check +/// +/// # Returns +/// `true` if the string contains "*", "X", or "Ter" fn has_stop(s: &str) -> bool { s.contains('*') || s.contains('X') || s.contains("Ter") }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(10 hunks)
🔇 Additional comments (3)
src/annotate/seqvars/csq.rs (3)
104-104
: LGTM: Clear type alias definition
The Consequences
type alias using BitFlags<Consequence>
provides a clear and type-safe way to handle multiple consequences.
804-864
: LGTM: Well-structured exonic variant analysis
The analyze_exonic_variant
method provides a clear and comprehensive analysis of exonic variants, including proper handling of splice regions and exon loss variants.
866-952
: LGTM: Thorough intronic variant analysis
The analyze_intronic_variant
method handles all relevant cases for intronic variants, including splice sites, regions, and polypyrimidine tracts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
src/annotate/seqvars/csq.rs (3)
4-4
: Remove unused importUncertainLengthChange
The import is not used in the code and can be safely removed.
-use hgvs::parser::{NoRef, ProteinEdit, UncertainLengthChange}; +use hgvs::parser::{NoRef, ProteinEdit};🧰 Tools
🪛 GitHub Check: clippy
[warning] 4-4: unused import:
UncertainLengthChange
warning: unused import:UncertainLengthChange
--> src/annotate/seqvars/csq.rs:4:40
|
4 | use hgvs::parser::{NoRef, ProteinEdit, UncertainLengthChange};
| ^^^^^^^^^^^^^^^^^^^^^
|
= note:#[warn(unused_imports)]
on by default
1034-1037
: Fix unused variables in frameshift handlingThe variables
alternative
,terminal
, andlength
in the frameshift case are unused. Either use them or mark them as ignored with_
.ProteinEdit::Fs { - alternative, - terminal, - length, + alternative: _, + terminal: _, + length: _, } => {🧰 Tools
🪛 GitHub Check: clippy
[warning] 1036-1036: unused variable:
length
warning: unused variable:length
--> src/annotate/seqvars/csq.rs:1036:29
|
1036 | ... length,
| ^^^^^^ help: try ignoring the field:length: _
[warning] 1035-1035: unused variable:
terminal
warning: unused variable:terminal
--> src/annotate/seqvars/csq.rs:1035:29
|
1035 | ... terminal,
| ^^^^^^^^ help: try ignoring the field:terminal: _
[warning] 1034-1034: unused variable:
alternative
warning: unused variable:alternative
--> src/annotate/seqvars/csq.rs:1034:29
|
1034 | ... alternative,
| ^^^^^^^^^^^ help: try ignoring the field:alternative: _
|
= note:#[warn(unused_variables)]
on by default
Line range hint
1850-1964
: Refactor consequence comparison logic for better readabilityThe consequence comparison logic contains multiple chained boolean expressions that could be refactored into a more maintainable structure. Consider:
- Creating a dedicated function for consequence comparison
- Using a lookup table or mapping for equivalent consequences
- Grouping related comparisons together
Example structure:
fn are_consequences_equivalent(expected: &[&str], actual: &[&str], tool: &str) -> bool { // Direct match if actual.iter().any(|csq| expected.contains(csq)) { return true; } // Tool-specific mappings match tool { "vep" => check_vep_equivalence(expected, actual), "snpeff" => check_snpeff_equivalence(expected, actual), _ => false, } } fn check_vep_equivalence(expected: &[&str], actual: &[&str]) -> bool { // VEP-specific mappings static VEP_MAPPINGS: &[(&str, &[&str])] = &[ ("disruptive_inframe_deletion", &["inframe_deletion"]), ("conservative_inframe_deletion", &["inframe_deletion"]), // ... other mappings ]; VEP_MAPPINGS.iter().any(|(exp, alts)| { expected.contains(exp) && actual.iter().any(|a| alts.contains(a)) }) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(10 hunks)
🧰 Additional context used
🪛 GitHub Check: clippy
src/annotate/seqvars/csq.rs
[warning] 4-4: unused import: UncertainLengthChange
warning: unused import: UncertainLengthChange
--> src/annotate/seqvars/csq.rs:4:40
|
4 | use hgvs::parser::{NoRef, ProteinEdit, UncertainLengthChange};
| ^^^^^^^^^^^^^^^^^^^^^
|
= note: #[warn(unused_imports)]
on by default
[warning] 1036-1036: unused variable: length
warning: unused variable: length
--> src/annotate/seqvars/csq.rs:1036:29
|
1036 | ... length,
| ^^^^^^ help: try ignoring the field: length: _
[warning] 1035-1035: unused variable: terminal
warning: unused variable: terminal
--> src/annotate/seqvars/csq.rs:1035:29
|
1035 | ... terminal,
| ^^^^^^^^ help: try ignoring the field: terminal: _
[warning] 1034-1034: unused variable: alternative
warning: unused variable: alternative
--> src/annotate/seqvars/csq.rs:1034:29
|
1034 | ... alternative,
| ^^^^^^^^^^^ help: try ignoring the field: alternative: _
|
= note: #[warn(unused_variables)]
on by default
🔇 Additional comments (1)
src/annotate/seqvars/csq.rs (1)
1040-1062
: Review commented out frameshift resolution logic
There's a significant block of commented code that handles special cases where frameshifts resolve into missense + stop retained variants. This logic seems important but is currently disabled.
Would you like me to help implement this frameshift resolution logic or create an issue to track this enhancement?
✅ Verification successful
Commented code appears to be experimental frameshift resolution enhancement
The commented code block represents an experimental enhancement for handling special frameshift cases that resolve into missense + stop retained variants. Based on the codebase analysis:
- The basic frameshift variant detection is already implemented and working (
consequences |= Consequence::FrameshiftVariant
) - The commented code is an additional refinement that would detect edge cases where frameshifts resolve back into frame
- The code has multiple TODOs indicating it was not fully implemented:
- "TODO calculate proper change positions"
- "TODO get the protein sequence and compare"
- "FIXME remove StopLost or keep either way?"
Since this is clearly marked as incomplete experimental code with multiple TODOs and no associated tests, it's appropriate to have it commented out rather than partially implemented.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if there are any related issues or discussions about this commented code
gh issue list --search "frameshift missense stop retained"
Length of output: 60
Script:
#!/bin/bash
# Search for any related code or tests handling frameshift variants
rg -A 5 "FrameshiftVariant" src/
# Search for any TODOs or FIXMEs related to frameshift handling
rg -i "todo.*frameshift|fixme.*frameshift" src/
# Look for any tests related to frameshift variants
fd "test.*\.rs" -x rg -l "frameshift"
Length of output: 2489
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
src/annotate/seqvars/csq.rs (2)
699-830
: Well-documented special case handling with clear examples.The
consequences_fix_special_cases
method effectively handles edge cases where CDS and protein predictions disagree. The code is well-documented with specific examples, making it easier to understand and maintain.Consider extracting some of the complex case handling into separate private methods to improve readability and maintainability. For example, the start lost and stop lost cases could be separate methods.
Line range hint
1850-2000
: Improve organization of test comparison logic.The test comparison logic handles many special cases for different variant effect predictors, but the code could be more maintainable. Consider:
- Extracting the comparison logic into a separate function or struct
- Using a more structured approach to define equivalence classes of effects
- Adding more detailed documentation explaining each special case
Example structure:
struct EffectComparator { vep_mappings: HashMap<String, Vec<String>>, snpeff_mappings: HashMap<String, Vec<String>>, } impl EffectComparator { fn new() -> Self { let mut vep_mappings = HashMap::new(); vep_mappings.insert( "disruptive_inframe_deletion".to_string(), vec!["inframe_deletion".to_string()] ); // ... more mappings ... Self { vep_mappings, snpeff_mappings: HashMap::new(), } } fn effects_match(&self, expected: &[String], actual: &[String], tool: &str) -> bool { // Implementation } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(10 hunks)
🔇 Additional comments (4)
src/annotate/seqvars/csq.rs (4)
832-892
: Clean and well-structured exonic variant analysis.
The analyze_exonic_variant
method has clear logic for handling different cases of exonic variants, including proper handling of splice regions. The code is well-organized and easy to follow.
894-980
: Comprehensive intronic variant analysis with clear documentation.
The analyze_intronic_variant
method thoroughly handles all cases of intronic variants, including splice sites and regions. Each case is well-documented with clear comments explaining the biological context.
982-1038
: Well-structured CDS variant analysis.
The analyze_cds_variant
method effectively handles UTR variants and coordinate-based analysis. The code is clean and well-documented.
1040-1134
: Thorough protein variant analysis with clear pattern matching.
The analyze_protein_variant
method effectively handles different types of protein edits through clear pattern matching. The code is well-structured and handles all cases comprehensively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
src/annotate/seqvars/csq.rs (1)
Line range hint
1852-2001
: Refactor test validation logic for better maintainability.The test validation logic contains complex conditional checks for comparing predictions from different tools. Consider extracting these checks into smaller, focused functions for better maintainability.
Example refactoring:
fn is_matching_prediction(record_csqs: &[&str], expected_one_of: &[&str], tool: &str) -> bool { let mut found_one = has_direct_match(record_csqs, expected_one_of); if tool.contains(".vep") { found_one |= is_vep_compatible_prediction(record_csqs, expected_one_of); } found_one |= is_frameshift_compatible(record_csqs, expected_one_of); found_one |= is_inframe_deletion_compatible(record_csqs, expected_one_of); // ... extract other checks into similar functions found_one } fn has_direct_match(record_csqs: &[&str], expected_one_of: &[&str]) -> bool { record_csqs.iter().any(|csq| expected_one_of.contains(csq)) } fn is_vep_compatible_prediction(record_csqs: &[&str], expected_one_of: &[&str]) -> bool { (record_csqs == ["coding_sequence_variant"] && !expected_one_of.is_empty()) || (record_csqs.contains(&"inframe_deletion") && (expected_one_of.contains(&"disruptive_inframe_deletion") || expected_one_of.contains(&"conservative_inframe_deletion"))) // ... other VEP-specific checks }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
src/annotate/seqvars/csq.rs
(10 hunks)
🔇 Additional comments (7)
src/annotate/seqvars/csq.rs (7)
104-104
: LGTM! Type alias improves code clarity.
The Consequences
type alias enhances code readability by providing a clear name for the BitFlags type.
834-894
: LGTM! Well-structured exonic variant analysis.
The function effectively handles:
- Exon loss variants
- Splice acceptor/donor variants
- Splice region variants
896-982
: LGTM! Comprehensive intronic variant analysis.
The function thoroughly handles:
- Splice acceptor/donor variants
- Splice region variants
- Polypyrimidine tract variants
- Splice donor region variants
984-1040
: LGTM! Thorough CDS variant analysis.
The function properly handles:
- Start/stop codon impacts
- UTR variants
- Input validation
699-832
: LGTM! Excellent special case handling with clear documentation.
The function effectively handles complex edge cases:
- Stop lost variants with protein translation verification
- Start lost variants with codon analysis
- Each case is well-documented with examples
1236-1238
:
Add input validation to prevent invalid ranges.
The overlaps
function is critical for determining variant positions but lacks validation for invalid inputs (e.g., end < start).
Apply this diff to add validation:
fn overlaps(var_start: i32, var_end: i32, exon_intron_start: i32, exon_intron_end: i32) -> bool {
+ debug_assert!(var_start <= var_end, "variant start must not be after end");
+ debug_assert!(
+ exon_intron_start <= exon_intron_end,
+ "exon/intron start must not be after end"
+ );
(var_start < exon_intron_end) && (var_end > exon_intron_start)
}
Likely invalid or redundant comment.
1042-1136
: 🛠️ Refactor suggestion
Consolidate duplicate is_stop
function implementations.
The is_stop
function is implemented twice with identical logic. Consider moving it to a common utility module or trait to maintain DRY principles.
Apply this diff to consolidate the implementations:
// Add to a common utility module or trait
+ pub(crate) fn is_stop(s: &str) -> bool {
+ s == "X" || s == "Ter" || s == "*"
+ }
// Remove duplicate implementations and use the common function
- fn is_stop(s: &str) -> bool {
- s == "X" || s == "Ter" || s == "*"
- }
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (3)
src/annotate/seqvars/csq.rs (3)
699-832
: Refactorconsequences_fix_special_cases
to reduce complexity and improve readability.The function
consequences_fix_special_cases
has a high number of arguments and complex nested logic, which can affect maintainability and readability. Consider refactoring by:
- Grouping related parameters into a struct.
- Breaking down the function into smaller, self-contained helper functions.
- Simplifying nested match statements where possible.
835-894
: Enhance test coverage foranalyze_exonic_variant
.To ensure the reliability of the
analyze_exonic_variant
function, consider adding unit tests that cover various edge cases, such as:
- Variants overlapping exon boundaries.
- Different strand orientations.
- Edge cases for splice site variants.
This will help catch potential bugs and ensure future changes do not introduce regressions.
896-982
: Enhance test coverage foranalyze_intronic_variant
.Similar to
analyze_exonic_variant
, adding comprehensive unit tests foranalyze_intronic_variant
will improve confidence in its correctness. Test cases might include:
- Variants at intron-exon junctions.
- Variants affecting splice donor/acceptor sites.
- Different intron lengths and positions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
⛔ Files ignored due to path filters (1)
tests/data/annotate/db/grch38/GRCh38-ensembl.disagreement-subset.txs.bin.zst
is excluded by!**/*.zst
📒 Files selected for processing (4)
.gitattributes
(1 hunks)src/annotate/seqvars/csq.rs
(12 hunks)tests/data/annotate/seqvars/vep.disagreement-cases.expected.vcf
(1 hunks)tests/data/annotate/seqvars/vep.disagreement-cases.vcf
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- tests/data/annotate/seqvars/vep.disagreement-cases.expected.vcf
- tests/data/annotate/seqvars/vep.disagreement-cases.vcf
🔇 Additional comments (1)
.gitattributes (1)
7-7
: LGTM! Configuration follows established pattern.
The new LFS configuration for the disagreement subset test data follows the existing pattern and is properly configured with appropriate filters.
Let's verify the referenced file exists and is tracked by Git LFS:
✅ Verification successful
Git LFS configuration is correctly applied and file exists
The file exists and the Git attributes are properly set with LFS configuration (filter=lfs, diff=lfs, merge=lfs) as specified in the .gitattributes file. The file size of 131 bytes suggests it's a placeholder file, which is expected for Git LFS tracked files in a sandbox environment.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the new test data file exists and is tracked by Git LFS
# Check if the file exists
if [ -f "tests/data/annotate/db/grch38/GRCh38-ensembl.disagreement-subset.txs.bin.zst" ]; then
echo "File exists"
# Verify Git LFS tracking
git lfs ls-files | grep "GRCh38-ensembl.disagreement-subset.txs.bin.zst"
else
echo "File not found"
fi
Length of output: 280
Script:
#!/bin/bash
# Let's verify if the file exists and check its attributes using different commands
# Check if the file exists and show its details
ls -l tests/data/annotate/db/grch38/GRCh38-ensembl.disagreement-subset.txs.bin.zst 2>/dev/null
# Check if the path exists using fd
fd "GRCh38-ensembl.disagreement-subset.txs.bin.zst" tests/
# Check if there are any Git attributes for this file
git check-attr -a tests/data/annotate/db/grch38/GRCh38-ensembl.disagreement-subset.txs.bin.zst 2>/dev/null
Length of output: 799
start_cds_from == CdsFrom::Start | ||
&& end_cds_from == CdsFrom::Start | ||
&& start_base % 3 == 1 | ||
&& (end_base + 1) % 3 == 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix modulo operation with negative numbers using rem_euclid
.
The use of the %
operator with potentially negative numbers (start_base
and end_base
) may lead to unexpected negative remainders, causing incorrect evaluations in the condition. To ensure correct and positive remainders, use the rem_euclid
method instead.
Apply this diff to fix the issue:
start_cds_from == CdsFrom::Start
&& end_cds_from == CdsFrom::Start
- && start_base % 3 == 1
- && (end_base + 1) % 3 == 1
+ && start_base.rem_euclid(3) == 1
+ && (end_base + 1).rem_euclid(3) == 1
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
start_cds_from == CdsFrom::Start | |
&& end_cds_from == CdsFrom::Start | |
&& start_base % 3 == 1 | |
&& (end_base + 1) % 3 == 1 | |
} | |
start_cds_from == CdsFrom::Start | |
&& end_cds_from == CdsFrom::Start | |
&& start_base.rem_euclid(3) == 1 | |
&& (end_base + 1).rem_euclid(3) == 1 | |
} |
Summary by CodeRabbit
New Features
Refactor