feat: implement are_equivalent functionality#1
Conversation
|
/gemini review |
- Introduced granular `IdentifierType` enum for better type safety. - Updated `get_symbol_accessions` and `get_seq` to use `IdentifierType`. - Implemented smart symbol expansion in `equivalence.rs` based on coordinate systems. - Ensured interchangeability between `IdentifierType` enums and strings in the Python bridge. - Fixed `rustc` deprecation warnings and improved Python stub generation.
|
/gemini review - There are significant changes. |
- Added `SequenceProxy` to `weaver/cli/provider.py` for recording and replaying genomic sequences. - Implemented assembly-aware manifest in sequence cache to handle multiple FASTA references. - Added `tests/test_hgvs_eval_integration.py` with 17 portably-runnable integration tests. - Created consolidated `tests/data/hgvs_eval_reference.gff` and `tests/data/hgvs_eval_sequences.json`. - Added `scripts/fetch_gff.py` utility for optimized GFF retrieval. - Updated Rust core and documentation for consistency.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: the are_equivalent function for comparing HGVS variants. This is a substantial change that touches many parts of the codebase, from the core Rust logic in hgvs-weaver to the Python bindings in src/lib.rs and the weaver package itself. The implementation correctly handles gene symbol expansion, cross-coordinate mapping, and variant normalization. New documentation and a comprehensive set of tests, including integration tests against hgvs-eval data, have been added to support this feature.
My review focuses on improving the robustness and maintainability of the new equivalence logic. Specifically, I've pointed out an incomplete implementation in a normalization function, some dead code, a potential performance regression, and opportunities to simplify and clarify complex coordinate-handling code.
…nts and improve robustness
…ew variant support information
… and unused imports
…epository clutter
- Fix identity variant (`c.=`) handling to prevent false frameshifts. - Implement surgical stop-codon suffix trimming to prevent massive deletion reports. - Resolve negative-strand delins translation mismatch by disabling inappropriate cDNA normalization in `c_to_p`. - Achieve 91.3% accuracy parity with `ref-hgvs`.
- Implement explicit nonsense (`p.Ter`) and extension (`ext*`) normalization - Fix `delins` formatting regression for single-residue substitutions - Add `AaEdit::Ext` formatting in `fmt.rs` - Implement `dna_repeat`, `pro_ext`, `pro_repeat`, `dna_con`, and `dna_copy` in parser - Update `sanity_parity_test.rs` expectations for improved accuracy - Add `normalization_test.rs` for nonsense and extension verification - Include category classification and extraction scripts
Modifies AltSeqBuilder to correctly handle Repeat variants by replacing the full repeat run instead of single base, fixing frameshift errors. Adds regression tests for repeat expansion and contraction. Also fixes logic in contingency.py (handling of Stop/Ter) and updates README with latest validation stats (98.6% match).
…vsError::TranscriptMismatch\n- Expose TranscriptMismatchError to Python\n- Update validation scripts to use structured errors\n- Fix unreachable pattern warnings in sequence.rs\n- Move SPDI logic to Rust and expose via PyO3
- Implement `Sequence` trait for iterator-based sequence access in Rust. - Refactor `AltSeqBuilder` to use lazy slicing and reduce eager string allocations. - Fix regression where synonymous variants (`c.360=`) were incorrectly translated as frameshifts. - Prioritize stop codons over frameshifts in protein translation if stop occurs at variant site. - Fix SPDI identity handling (`REF:REF` representation). - Integrate analysis and contingency reporting into `weaver-analyze`. - Add regression test suite `tests/test_regressions.rs`.
…d SPDI shifting. Update benchmarks to 100k variants.
…ts\n\n- Added `to_spdi` logic for `NaEdit::Repeat` in `structs.rs`\n- Implemented `c_to_g` in `DataProvider` trait for intronic offset resolution\n- Updated Python bridge and `RefSeqDataProvider` with `c_to_g` support\n- Fixed all test mocks and resolved `Unsupported operation` errors for these categories\n- Updated README.md with 100k ClinVar validation results (93.917% SPDI match)
…ngth check and insertion detection in `mapper.rs`
|
/gemini review This branch has added a large number of new features, and needs a thorough re-review. |
There was a problem hiding this comment.
Code Review
This is an impressive pull request that introduces significant new functionality and major improvements to the codebase. The implementation of the are_equivalent function is well-structured and handles complex cases like gene symbol expansion and cross-coordinate mapping. The refactoring of sequence handling using the Sequence trait is a standout improvement, promoting efficiency and code clarity. Moving the SPDI conversion logic into the Rust core is also a great decision.
I have one suggestion for improvement in an analysis script to ensure the reported validation metrics are as accurate as possible. Overall, this is excellent work.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This PR adds the 'are_equivalent' function to compare HGVS variants, handling normalization, coordinate mapping, and gene symbols.