Implement pluggable Lineage in Java SDK#36781
Conversation
| ? (JmsTextMessage message) -> { | ||
| if (message == null) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
irrelevant change to fix flaky tests
| assertTrue( | ||
| String.format("Too many unacknowledged messages: %d", unackRecords), | ||
| unackRecords < OPTIONS.getNumberOfRecords() * 0.003); | ||
| unackRecords < OPTIONS.getNumberOfRecords() * 0.005); |
There was a problem hiding this comment.
irrelevant change to fix flaky tests
e8b6a7e to
661f5c7
Compare
23b45a2 to
4065277
Compare
|
Assigning reviewers: R: @kennknowles for label java. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #36781 +/- ##
============================================
- Coverage 56.87% 56.86% -0.01%
+ Complexity 3417 3414 -3
============================================
Files 1178 1178
Lines 187492 187495 +3
Branches 3590 3590
============================================
- Hits 106628 106620 -8
- Misses 77472 77480 +8
- Partials 3392 3395 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Reminder, please take a look at this pr: @kennknowles |
032292b to
6d7a434
Compare
|
|
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @Abacn for label java. Available commands:
|
|
@Abacn or @kennknowles would you mind taking a look? The change is low risk and introduces some important flexibility |
|
Reminder, please take a look at this pr: @Abacn |
|
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
|
|
Reminder, please take a look at this pr: @kennknowles |
|
waiting on author |
6d7a434 to
f61e592
Compare
| // Reserved characters are backtick, colon, whitespace (space, \t, \n) and dot. | ||
| private static final Pattern RESERVED_CHARS = Pattern.compile("[:\\s.`]"); | ||
|
|
||
| private final LineageBase delegate; |
There was a problem hiding this comment.
FYI
We are using Lineage as a facade around LineageBase, so we don't expose the latter and we avoid any (breaking) changes in other parts of Beam
e116899 to
7bcab65
Compare
|
@kennknowles Thanks for the feedback! I've updated the PR description with a detailed design section addressing your questions. Selection Model: Single active plugin (first match wins via ServiceLoader), fallback to I've added a "Why Pluggable Lineage?" section to the PR description with four key use cases:
let me know if you have any questions! |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
Addresses #36790: "[Feature Request]: Make lineage tracking pluggable"
Changes
org.apache.beam.sdk.lineage.LineageBase- Plugin interface with singleadd()methodorg.apache.beam.sdk.metrics.Lineage- Hardcoded metrics → Delegation toLineageBasepluginsorg.apache.beam.sdk.lineage.LineageRegistrar- Plugin discovery interface (ServiceLoader)org.apache.beam.sdk.metrics.MetricsLineage- Default metrics-based implementation (implementsLineageBase)FileSystems.setDefaultPipelineOptions()Architecture
Before (master):
Lineagewas a concrete class hardcoded to use Beam metrics:After (this PR): Clean separation via composition pattern:
Plugin Selection: ServiceLoader discovery, first match wins, fallback to
MetricsLineage.Backward Compatibility: ✅ All existing code works unchanged (24+ call sites, static utilities, enums).
Why Pluggable Lineage?
1. Runner Fragmentation
Metrics-based lineage is scattered across runners with inconsistent support:
Impact: Multi-runner organizations must consolidate lineage from different metrics backends, each with different APIs and formats.
Plugin Solution: Single implementation works consistently across all runners.
2. Enterprise Integration
Organizations with existing lineage infrastructure need:
Example: Flyte workflow executing a Beam pipeline needs to tag lineage with Flyte execution ID and cost allocation. This context exists in the orchestrator, not in Beam workers' metrics.
3. Standard Formats
OpenLineage is the industry standard. Plugin enables direct emission vs. export metrics → parse → transform → send.
Initialization
Lineage.setDefaultPipelineOptions(options)is called fromFileSystems.setDefaultPipelineOptions()(same pattern asMetrics).Rationale:
FileSystems.setDefaultPipelineOptions()is called at 48+ locations covering all execution scenarios (pipeline construction, worker startup, deserialization).Known Limitation: Follows existing
FileSystemspattern despite known issues (#18430). Architectural improvements would address all subsystems together.Thread Safety
Uses
AtomicReferencewithcompareAndSetloop (same pattern asFileSystems/Metrics):AtomicReference<KV<Long, Integer>>tracks PipelineOptions identityAtomicReference<Lineage>for SOURCES/SINKS instancesExample: OpenLineage Plugin
For demonstration only (OpenLineage integration out of scope)