Skip to content

Implement pluggable Lineage in Java SDK#36781

Open
shnapz wants to merge 16 commits intoapache:masterfrom
shnapz:pluggable-lineage
Open

Implement pluggable Lineage in Java SDK#36781
shnapz wants to merge 16 commits intoapache:masterfrom
shnapz:pluggable-lineage

Conversation

@shnapz
Copy link

@shnapz shnapz commented Nov 10, 2025

Addresses #36790: "[Feature Request]: Make lineage tracking pluggable"

Changes

  • Created org.apache.beam.sdk.lineage.LineageBase - Plugin interface with single add() method
  • Refactored org.apache.beam.sdk.metrics.Lineage - Hardcoded metrics → Delegation to LineageBase plugins
  • Created org.apache.beam.sdk.lineage.LineageRegistrar - Plugin discovery interface (ServiceLoader)
  • Extracted org.apache.beam.sdk.metrics.MetricsLineage - Default metrics-based implementation (implements LineageBase)
  • Added Plugin initialization in FileSystems.setDefaultPipelineOptions()

Architecture

Before (master): Lineage was a concrete class hardcoded to use Beam metrics:

public class Lineage {
  private static final Lineage SOURCES = new Lineage(Type.SOURCE);
  private static final Lineage SINKS = new Lineage(Type.SINK);
  private final Metric metric;  // Hardcoded to Beam metrics

  private Lineage(Type type) {
    this.metric = Metrics.stringSet(LINEAGE_NAMESPACE, type.toString());
  }

  public void add(Iterable<String> segments) {
    ((StringSet) metric).add(String.join("", segments));  // Always metrics
  }
}

After (this PR): Clean separation via composition pattern:

// Plugin contract (simple interface)
public interface LineageBase {
  void add(Iterable<String> rollupSegments);
}

// Public API (final facade delegating to plugin)
public final class Lineage {
  private final LineageBase delegate;  // Plugin implementation

  private Lineage(LineageBase delegate) {
    this.delegate = delegate;
  }

  // Delegates to plugin
  public void add(Iterable<String> segments) {
    delegate.add(segments);
  }

  // Convenience overloads
  public void add(String system, Iterable<String> segments) { ... }
  public void add(String system, String subtype, ...) { ... }

  // Static utilities (unchanged)
  public static Lineage getSources() { ... }
  public static Lineage getSinks() { ... }
  public static String wrapSegment(String value) { ... }
  public static Set<String> query(MetricResults results, Type type) { ... }
}

// Default implementation (backward compatible)
public class MetricsLineage implements LineageBase {
  private final Metric metric;

  @Override
  public void add(Iterable<String> segments) {
    ((BoundedTrie) metric).add(segments);
  }
}

Plugin Selection: ServiceLoader discovery, first match wins, fallback to MetricsLineage.

Backward Compatibility: ✅ All existing code works unchanged (24+ call sites, static utilities, enums).

Why Pluggable Lineage?

1. Runner Fragmentation

Metrics-based lineage is scattered across runners with inconsistent support:

  • Dataflow: Real-time metrics export to Cloud Monitoring
  • Flink: Batch-only aggregation (no streaming support yet)
  • Spark/Direct: Varying levels of support

Impact: Multi-runner organizations must consolidate lineage from different metrics backends, each with different APIs and formats.

Plugin Solution: Single implementation works consistently across all runners.

2. Enterprise Integration

Organizations with existing lineage infrastructure need:

  • Direct API integration (Atlan, Collibra, Marquez, DataHub, OpenLineage)
  • Custom metadata enrichment not in metrics subsystem

Example: Flyte workflow executing a Beam pipeline needs to tag lineage with Flyte execution ID and cost allocation. This context exists in the orchestrator, not in Beam workers' metrics.

3. Standard Formats

OpenLineage is the industry standard. Plugin enables direct emission vs. export metrics → parse → transform → send.

Initialization

Lineage.setDefaultPipelineOptions(options) is called from FileSystems.setDefaultPipelineOptions() (same pattern as Metrics).

Rationale: FileSystems.setDefaultPipelineOptions() is called at 48+ locations covering all execution scenarios (pipeline construction, worker startup, deserialization).

Known Limitation: Follows existing FileSystems pattern despite known issues (#18430). Architectural improvements would address all subsystems together.

Thread Safety

Uses AtomicReference with compareAndSet loop (same pattern as FileSystems/Metrics):

  • AtomicReference<KV<Long, Integer>> tracks PipelineOptions identity
  • AtomicReference<Lineage> for SOURCES/SINKS instances

Example: OpenLineage Plugin

For demonstration only (OpenLineage integration out of scope)

// 1. Plugin options
public interface OpenLineageOptions extends PipelineOptions {
  @Description("OpenLineage endpoint URL")
  String getOpenLineageUrl();
  void setOpenLineageUrl(String url);

  @Description("Enable OpenLineage plugin")
  @Default.Boolean(false)
  Boolean getEnableOpenLineage();
  void setEnableOpenLineage(Boolean enable);
}

// 2. Implement LineageBase
class OpenLineageReporter implements LineageBase {
  private final String endpoint;
  private final Lineage.LineageDirection direction;

  @Override
  public void add(Iterable<String> rollupSegments) {
    String fqn = String.join("", rollupSegments);
    // POST to OpenLineage API with workflow context
    sendToOpenLineage(endpoint, direction, fqn);
  }
}

// 3. Register via ServiceLoader
@AutoService(LineageRegistrar.class)
public class OpenLineageRegistrar implements LineageRegistrar {
  @Override
  public LineageBase fromOptions(PipelineOptions options, Lineage.LineageDirection direction) {
    OpenLineageOptions opts = options.as(OpenLineageOptions.class);
    if (opts.getEnableOpenLineage()) {
      return new OpenLineageReporter(opts.getOpenLineageUrl(), direction);
    }
    return null; // Fall back to MetricsLineage
  }
}

// 4. Usage
PipelineOptions options = PipelineOptionsFactory.create();
options.as(OpenLineageOptions.class).setEnableOpenLineage(true);
options.as(OpenLineageOptions.class).setOpenLineageUrl("https://lineage-api.example.com");
Pipeline p = Pipeline.create(options);

@github-actions github-actions bot added the java label Nov 10, 2025
@shnapz shnapz changed the title Add pluggable LineageReporter interface Make Lineage pluggable Nov 11, 2025
@shnapz shnapz changed the title Make Lineage pluggable Implement pluggable Lineage in Java SDK Nov 11, 2025
? (JmsTextMessage message) -> {
if (message == null) {
return null;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irrelevant change to fix flaky tests

assertTrue(
String.format("Too many unacknowledged messages: %d", unackRecords),
unackRecords < OPTIONS.getNumberOfRecords() * 0.003);
unackRecords < OPTIONS.getNumberOfRecords() * 0.005);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

irrelevant change to fix flaky tests

@shnapz shnapz marked this pull request as ready for review December 31, 2025 17:51
@github-actions
Copy link
Contributor

Assigning reviewers:

R: @kennknowles for label java.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@codecov
Copy link

codecov bot commented Dec 31, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.86%. Comparing base (298f309) to head (7bcab65).
⚠️ Report is 12 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #36781      +/-   ##
============================================
- Coverage     56.87%   56.86%   -0.01%     
+ Complexity     3417     3414       -3     
============================================
  Files          1178     1178              
  Lines        187492   187495       +3     
  Branches       3590     3590              
============================================
- Hits         106628   106620       -8     
- Misses        77472    77480       +8     
- Partials       3392     3395       +3     
Flag Coverage Δ
java 71.92% <ø> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

Reminder, please take a look at this pr: @kennknowles

@shnapz
Copy link
Author

shnapz commented Jan 12, 2026

[PreCommit Java / beam_PreCommit_Java](https://github.com/apache/beam/actions/runs/20927101528/job/60127906908?pr=36781) has apparently unrelated grpc client failure. Before I rebased on latest master it was green

@github-actions
Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@shnapz
Copy link
Author

shnapz commented Jan 20, 2026

@Abacn or @kennknowles would you mind taking a look? The change is low risk and introduces some important flexibility

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @Abacn

@github-actions
Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

Reminder, please take a look at this pr: @kennknowles

@kennknowles
Copy link
Member

waiting on author

// Reserved characters are backtick, colon, whitespace (space, \t, \n) and dot.
private static final Pattern RESERVED_CHARS = Pattern.compile("[:\\s.`]");

private final LineageBase delegate;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI
We are using Lineage as a facade around LineageBase, so we don't expose the latter and we avoid any (breaking) changes in other parts of Beam

@shnapz shnapz force-pushed the pluggable-lineage branch from e116899 to 7bcab65 Compare March 6, 2026 17:25
@shnapz
Copy link
Author

shnapz commented Mar 6, 2026

@kennknowles Thanks for the feedback! I've updated the PR description with a detailed design section addressing your questions.

Selection Model: Single active plugin (first match wins via ServiceLoader), fallback to MetricsLineage. This matches Beam's existing patterns (similar to PipelineOptions registrars).

I've added a "Why Pluggable Lineage?" section to the PR description with four key use cases:

  • Runner Fragmentation. Multi-runner orgs need unified lineage.
  • Enterprise Integration. Direct integration with existing lineage systems + workflow-level context unavailable to worker metrics.
  • Standard Formats. Direct OpenLineage emission vs. export -> parse -> transform -> send.

let me know if you have any questions!

@kennknowles
Copy link
Member

R: @rohitsinha54

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment assign set of reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants