Skip to content

feat(dv): records matching and reporting PTransforms#3440

Open
manitgupta wants to merge 2 commits intomainfrom
dv-6b-processors
Open

feat(dv): records matching and reporting PTransforms#3440
manitgupta wants to merge 2 commits intomainfrom
dv-6b-processors

Conversation

@manitgupta
Copy link
Member

@manitgupta manitgupta commented Mar 5, 2026

TL;DR

Added two new composite PTransform classes for the GCS-Spanner data validation pipeline: MatchRecordsTransform for comparing records from source and Spanner to identify mismatches, and ReportResultsTransform for generating table-level validation statistics and surfacing mismatched records.

What changed?

  • MatchRecordsTransform: New PTransform that takes collections of ComparisonRecord objects from both Source and Spanner, applies a CoGroupByKey to pair them up by primary key, and utilizes the FunnelComparedRecordsFn to categorize exactly which records are matched, missing in Spanner, or missing in the source.
  • ReportResultsTransform: New PTransform that consumes the categorized records from the matching phase. It delegates to ComputeTableStatsFn to aggregate table-level statistics (e.g., total matched, total missing) and prepares the mismatched records for final output.

Added unit tests for both PTransform classes to ensure the matching logic and aggregation pipelines are wired correctly.

How to test?

Run the new unit tests:

  • MatchRecordsTransformTest - Tests the CoGroupByKey integration, ensuring mock Source and Spanner records are correctly joined and partitioned into the expected output tags.
  • ReportResultsTransformTest - Tests the aggregation pipeline, verifying that input matched/mismatched records correctly produce aggregated table statistics.

Why make this change?

These PTransform classes encapsulate the core business logic of the validation tool. By separating the matching (MatchRecordsTransform) from the aggregation and reporting (ReportResultsTransform), the pipeline maintains clear boundaries. This modularity makes it easier to independently scale the memory-intensive matching phase and the subsequent reporting phase. This PR builds upon the ingestion transforms introduced in dv-6a-readers.

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates essential data validation capabilities by introducing two new Apache Beam PTransforms. The first transform efficiently matches records between a source and a destination, identifying discrepancies. The second transform then systematically reports these validation outcomes, including detailed mismatches and aggregated statistics, to BigQuery for further analysis and monitoring.

Highlights

  • New PTransform for Record Matching: Introduced MatchRecordsTransform which compares records from a source (GCS) and a destination (Spanner) based on their hash, categorizing them into matched, missing in source, or missing in Spanner.
  • New PTransform for Results Reporting: Added ReportResultsTransform to handle the output of the record matching process. This transform writes detailed mismatched records, table-level validation statistics, and an overall validation summary to BigQuery.
  • Comprehensive Unit Tests: Included dedicated unit tests for both MatchRecordsTransform and ReportResultsTransform to ensure the correctness of record comparison logic and the calculation of validation metrics.
Changelog
  • v2/gcs-spanner-dv/src/main/java/com/google/cloud/teleport/v2/transforms/MatchRecordsTransform.java
    • Added a new PTransform for comparing records from source and Spanner, identifying matched, missing-in-source, and missing-in-Spanner records.
  • v2/gcs-spanner-dv/src/main/java/com/google/cloud/teleport/v2/transforms/ReportResultsTransform.java
    • Added a new PTransform responsible for reporting data validation results to BigQuery, including mismatched records, table statistics, and a validation summary.
    • Implemented logic to explicitly register SchemaCoders for ValidationSummary to prevent runtime errors in Beam.
  • v2/gcs-spanner-dv/src/test/java/com/google/cloud/teleport/v2/transforms/MatchRecordsTransformTest.java
    • Added unit tests for MatchRecordsTransform covering various record matching scenarios, including full matches, missing in Spanner, missing in source, and mixed cases.
  • v2/gcs-spanner-dv/src/test/java/com/google/cloud/teleport/v2/transforms/ReportResultsTransformTest.java
    • Added unit tests for ReportResultsTransform's record transformation and statistics calculation logic, specifically for transformMismatchedRecords, calculateTableStats, and calculateValidationSummary.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@manitgupta manitgupta added new-template improvement Making existing code better labels Mar 5, 2026
@manitgupta manitgupta marked this pull request as ready for review March 5, 2026 05:07
@manitgupta manitgupta requested a review from a team as a code owner March 5, 2026 05:07
@manitgupta manitgupta requested review from darshan-sj and rohitwali and removed request for a team March 5, 2026 05:07
Copy link
Contributor

@bharadwaj-aditya bharadwaj-aditya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine overall. Couple of questions.

If they don't seem major, please go ahead and merge.

Base automatically changed from dv-6a-readers to main March 6, 2026 10:24
@manitgupta manitgupta dismissed bharadwaj-aditya’s stale review March 6, 2026 10:24

The base branch was changed.

@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.90%. Comparing base (ecf2ee8) to head (3da3446).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3440      +/-   ##
============================================
+ Coverage     51.93%   57.90%   +5.96%     
+ Complexity     5457     1887    -3570     
============================================
  Files          1018      485     -533     
  Lines         61845    27829   -34016     
  Branches       6813     3011    -3802     
============================================
- Hits          32120    16113   -16007     
+ Misses        27498    10756   -16742     
+ Partials       2227      960    -1267     
Components Coverage Δ
spanner-templates 72.75% <ø> (+1.01%) ⬆️
spanner-import-export ∅ <ø> (∅)
spanner-live-forward-migration 80.37% <ø> (-0.02%) ⬇️
spanner-live-reverse-replication 77.78% <ø> (-0.06%) ⬇️
spanner-bulk-migration 88.45% <ø> (-0.02%) ⬇️
gcs-spanner-dv 86.24% <ø> (-0.78%) ⬇️
see 556 files with indirect coverage changes
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Making existing code better new-template size/XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants