Skip to content

refactor/equivalence level enum#2

Merged
folded merged 6 commits intomainfrom
refactor/equivalence-level-enum
Feb 22, 2026
Merged

refactor/equivalence level enum#2
folded merged 6 commits intomainfrom
refactor/equivalence-level-enum

Conversation

@folded
Copy link
Owner

@folded folded commented Feb 21, 2026

  • feat: Protein variant projection improvements and v0.3.0 release
  • Move performance graph to external themed SVGs and use in README

This commit consolidates a series of structural and correctness improvements to the
protein variant projection engine and establishes the v0.3.0 baseline.

Core Library (hgvs-weaver):
- Refactored `EquivalenceLevel` enum for clarity (Identity vs. Analogous).
- Implemented support for Selenoproteins, allowing correct variant projection in
  transcripts with internal stop codons (e.g., GPX1).
- Added a \"God-Mode\" threshold for CDS end coordinates to filter 3' UTR noise:
  Variants after the primary stop codon are now correctly projected as p.(=).
- Corrected distal frameshift termination offsets to follow HGVS spec (+1 relative
  to the synthesized stop codon).
- Refactored coordinate mapping to be anchor-aware, resolving off-by-one errors
  at transcript boundaries.
- Improved repeat region handling by fetching full reference sequences for alignment.
- Enhanced out-of-bounds error reporting with explicit coordinate validation.

Performance & Visualization:
- Overhauled README.md performance dashboard with a new stacked bar chart format.
- Added success breakdown: Identity Match vs. Analogous (biologically equivalent) Match.
- Recalibrated historical benchmarking data (back to 0.1.0) to include the new
  granular success metrics for both Weaver and Ref-HGVS.
- Achieved a validated 98.77% total success rate on 100k ClinVar variants.

Tooling & Release:
- Bumped version to 0.3.0 across pyproject.toml and Cargo.toml.
- Updated analysis and visualization scripts for automated README generation.
- Archived full validation results for the 0.3.0 release in the benchmark directory.
@gemini-code-assist
Copy link

Summary of Changes

Hello @folded, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the accuracy and robustness of protein variant projection and introduces a sophisticated system for classifying variant equivalence. By implementing granular equivalence levels and improving amino acid handling, the system can now better identify biologically identical variants despite notational differences. Additionally, the user experience is improved with theme-aware performance visualizations and clearer documentation of the new equivalence logic.

Highlights

  • Enhanced Variant Equivalence Logic: Introduced a granular EquivalenceLevel enum (Identity, Analogous, Different, Unknown) to precisely categorize variant comparisons. This new system leverages SparseReference and ProjectedSequence for robust protein and nucleotide sequence comparison, enabling the detection of biologically analogous variants even with different HGVS notations.
  • Improved Protein Variant Projection: Refactored the protein projection logic to correctly handle 5' UTR variants, ensure proper handling of premature stop codons in frameshifts, and accurately map insertions to duplications. This includes a 'GOD-MODE FIX' for differences occurring after official stop codons, significantly improving accuracy.
  • Theme-Aware Performance Graphs: Updated the README.md to display performance benchmark results using theme-aware SVG images. This allows the graphs to adapt to light and dark modes, improving readability and user experience.
  • Python Bindings and CLI Updates: Exposed the new EquivalenceLevel enum and equivalent_level method to Python, enhancing the Python API. The validation and analysis CLI tools were updated to utilize the new equivalence levels and generate the theme-aware performance graphs.
  • Robust Amino Acid Handling: Implemented new utility functions (Residue enum, aa3_to_aa1, normalize_aa, decompose_aa) for more consistent and accurate processing of 1-letter and 3-letter amino acid codes throughout the system, particularly in formatting and comparison.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • Cargo.toml
    • Updated project version to 0.3.0.
    • Modified pyo3 features to explicitly define extension-module.
  • README.md
    • Replaced embedded SVG performance graph with a theme-aware <picture> element linking to external SVG files.
    • Updated validation results table to reflect new equivalence metrics (Identity Match, Analogous Match, Total Success).
    • Added a troubleshooting section for stub_gen linking errors.
  • benchmark_results/history.json
    • Updated benchmark data structure to include w_identity, w_analogous, ref_identity, and ref_analogous metrics.
    • Added a new benchmark entry for version 0.3.0.
  • benchmark_results/performance_dark.svg
    • Added new SVG file for the dark mode performance graph.
  • benchmark_results/performance_light.svg
    • Added new SVG file for the light mode performance graph.
  • docs/source/equivalence_logic.md
    • Rewrote the equivalence logic documentation to detail the new granular equivalence levels and comparison strategies.
    • Provided examples for Identity, Analogous, Different, and Unknown equivalence levels.
  • hgvs-weaver/Cargo.toml
    • Updated project version to 0.3.0.
  • hgvs-weaver/src/altseq.rs
    • Modified AltTranscriptData to include variant_start_idx and variant_end_idx.
    • Adjusted protein position calculation to correctly handle 5' UTR variants.
    • Removed redundant premature stop codon check for frameshifts.
    • Added bounds checking for transcript coordinates.
  • hgvs-weaver/src/altseq_to_hgvsp.rs
    • Modified AltSeqToHgvsp to include ref_cds_start_idx and ref_cds_end_idx for more accurate protein projection.
    • Refactored build_identity_variant for clarity and correctness.
    • Implemented a 'GOD-MODE FIX' to ensure differences after the official stop codon do not incorrectly affect protein variants.
    • Refined nonsense variant detection logic.
  • hgvs-weaver/src/analogous_edit.rs
    • Added new module for analogous edit detection.
    • Introduced ResidueToken enum for representing amino acid states (Known, Unknown, Any, Wildcard).
    • Defined SparseReference for managing sparse protein sequence data.
    • Implemented ProjectedSequence for representing and comparing projected variant outcomes.
    • Provided functions (apply_aa_edit_to_sparse, apply_na_edit_to_sparse, project_aa_variant, project_na_variant) for applying edits to sparse references and projecting variants.
    • Developed reconcile_projections and UnificationEnv for unifying and comparing projected sequences to determine analogous equivalence.
  • hgvs-weaver/src/edits.rs
    • Added is_identity method to AaEdit enum for easier identification of identity variants.
  • hgvs-weaver/src/equivalence.rs
    • Introduced EquivalenceLevel enum to categorize variant comparison outcomes.
    • Refactored are_equivalent to use equivalent_level for granular results.
    • Implemented equivalent_level_single to perform detailed comparisons using SparseReference and projection logic for protein and nucleotide variants.
    • Added helper functions is_cross_type_identity and get_effective_end for improved cross-type comparisons and protein variant end position calculation.
    • Integrated SparseReference and projection logic for p_vs_p_equivalent comparisons.
  • hgvs-weaver/src/fmt.rs
    • Improved formatting for AAPosition to consistently use 3-letter amino acid codes.
    • Enhanced AaEdit formatting for Fs, Ext, and Repeat edits to correctly handle 1-letter and 3-letter amino acid codes and Ter representation.
  • hgvs-weaver/src/lib.rs
    • Added analogous_edit module to the library.
  • hgvs-weaver/src/mapper.rs
    • Updated c_to_p method to pass cds_start_idx and cds_end_idx to AltSeqToHgvsp for more precise protein projection.
    • Adjusted normalize_variant logic for genomic insertions to correctly set start_idx.
  • hgvs-weaver/src/sequence.rs
    • Removed premature stop codon check from TranslateIterator to allow full translation of CDS.
  • hgvs-weaver/src/structs.rs
    • Corrected spdi_interval calculation for BaseOffsetInterval to accurately handle single-base insertions.
  • hgvs-weaver/src/utils.rs
    • Introduced Residue enum for representing amino acids.
    • Added aa3_to_aa1 and normalize_aa functions for consistent amino acid code conversion.
    • Implemented decompose_aa for robustly parsing protein sequences into Residue vectors, supporting both 1-letter and 3-letter codes.
  • hgvs-weaver/tests/analogous_test.rs
    • Added new test file for analogous_edit module.
    • Included tests for ResidueToken unification, analogous duplication shifting, protein repeat shifting, complex unification aliases, and ClinVar regression cases.
  • hgvs-weaver/tests/bug_repro_test.rs
    • Removed test file as issues were resolved or refactored.
  • hgvs-weaver/tests/decompose_test.rs
    • Added new test file for decompose_aa utility, covering strict 1-letter and 3-letter code parsing.
  • hgvs-weaver/tests/equivalence_test.rs
    • Added new test file for EquivalenceLevel functionality, including tests for identity, analogous, and parity matches.
  • hgvs-weaver/tests/mapping_test.rs
    • Updated test_mapper_c_to_g_3utr to use Option<&str> for reference_ac.
  • hgvs-weaver/tests/normalization_test.rs
    • Updated test_extension_normalization to reflect new protein formatting for extension variants.
  • hgvs-weaver/tests/regression_test.rs
    • Added new test file for regression cases, including c.35_36insT vs c.35dup and BRAF identity comparisons.
  • hgvs-weaver/tests/spec_summary_test.rs
    • Updated test_spec_summary_variants_normalized to reflect new protein formatting for Trp24Ter.
  • hgvs-weaver/tests/test_regressions.rs
    • Added new tests for test_regression_gln4del_vs_ter and test_regression_pro_ile_mismatch to address specific protein projection regressions.
    • Added test_regression_parse_clinvar_repeat to ensure correct parsing of repeat variants.
  • hgvs-weaver/tests/toy_data_test.rs
    • Updated test_toy_minus_strand_mapping to use Option<&str> for reference_ac.
  • pyproject.toml
    • Updated project version to 0.3.0.
  • scripts/analyze_validation.py
    • Updated script to use weaver.VariantMapper and mapper.equivalent for more robust protein equivalence checks.
    • Modified output to report Identity and Analogous matches separately for Weaver and Ref-HGVS.
    • Added JSON output for current_stats.json to track performance history.
  • scripts/classify_failures.py
    • Integrated weaver.EquivalenceLevel and mapper.equivalent_level for advanced classification of failure categories.
    • Added check_consistency function to verify biological equivalence between nucleotide and protein variants.
    • Enhanced classification logic to distinguish between 'ClinVar Match', 'Biological Equivalence', and 'Parity Match' based on granular equivalence levels.
  • scripts/update_readme_performance.py
    • Modified script to generate separate light and dark themed SVG performance graphs.
    • Updated graph generation to use stacked bar charts for 'Identity %' and 'Analogous %' for Weaver and Ref-HGVS.
    • Updated README injection to use <picture> tag for theme-aware SVG display.
  • src/lib.rs
    • Exposed EquivalenceLevel enum to Python as PyEquivalenceLevel.
    • Added equivalent_level method to PyVariantMapper for Python API.
    • Made PyIdentifierType and PyEquivalenceLevel hashable in Python.
  • tests/test_hgvs_eval_integration.py
    • Updated test case for NM_005813.3:c.2673insA to NM_005813.3:c.2673insG for consistency.
  • weaver/init.py
    • Exported EquivalenceLevel from the _weaver module.
  • weaver/_weaver.pyi
    • Updated stub file to include EquivalenceLevel enum and its methods.
    • Adjusted type hints for Variant.gene and VariantMapper methods.
    • Added equivalent_level method to VariantMapper in the stub file.
  • weaver/cli/provider.py
    • Implemented _translate_cds method for accurate protein sequence translation from CDS.
    • Modified get_seq to correctly retrieve protein sequences by translating the CDS region of a transcript.
  • weaver/cli/validate.py
    • Added equivalence_level field to the output CSV.
    • Integrated _rs_mapper.equivalent_level to determine and report the equivalence level for each variant.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to protein variant projection and a new granular EquivalenceLevel enum, which enhances the library's ability to distinguish between identical notations and biologically equivalent but differently noted variants (e.g., insertions vs. duplications). The move to external themed SVGs in the README and the addition of a troubleshooting section for stub_gen are also welcome improvements. However, there is a regression in how variants affecting the stop codon are normalized, and an efficiency bottleneck in the Python DataProvider's protein sequence retrieval that should be addressed to maintain the library's high-performance goals.

- Normalize p.fsTer1 to standard nonsense (e.g., p.Arg83Ter) or synonymous (e.g., p.Ter1=) descriptions as per HGVS recommendations.
- Refactored reporting layer in `altseq_to_hgvsp.rs` to handle this edge case while preserving mechanisitic metadata in the core.
- Added regression tests in `analogous_test.rs`.
- Finalized v0.3.0 performance visualizations and pre-commit fixes.
Avoids functools.lru_cache on method to eliminate potential memory leak warnings and provide explicit cache management.
@folded folded merged commit 93058ee into main Feb 22, 2026
4 checks passed
@folded folded deleted the refactor/equivalence-level-enum branch February 22, 2026 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant