Skip to content

Benchmark runner: evaluates Compass on the A2A FAIR Signposting test pages

License

Notifications You must be signed in to change notification settings

qbicsoftware/compass-a2a-runner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Compass Apple-2-Apple runner

Run Maven Tests DOI release license

Offline, apple-to-apple benchmark runner for FAIR Signposting (A2A) recipes using Compass.

This project is designed to answer one practical question:

Given a fixed set of “correct” and “faulty” FAIR Signposting examples, how well does the Compass validator detect the expected relations and issues?

It does that by:

  1. Harvesting HTTP response headers from the published A2A benchmark pages into offline fixtures (YAML files).
  2. Replaying those fixtures locally (no network needed for the benchmark run).
  3. Validating the discovered Web Links with Compass.
  4. Asserting the output against a curated set of expectations (relations + expected errors/warnings).

What this runner is (and is not)

✅ It is

  • An integration benchmark for Compass on a known dataset (A2A FAIR Signposting scenarios).
  • Offline and reproducible once fixtures are harvested.
  • A way to detect regressions: if Compass behavior changes, the benchmark will fail and show where.

❌ It is not

  • A performance benchmark (speed/throughput). It’s primarily correctness/behavior benchmarking.
  • A crawler. During benchmark execution it does not fetch URLs; it only processes recorded HTTP headers.

Benchmark scope (important)

This repository currently benchmarks only the A2A HTTP scenarios, because that is Compass’ primary focus: validating declared Web Links (e.g. HTTP Link headers) in an offline manner.

Not included (by design):

  • Inline HTML signposting (e.g. <link ...> tags embedded in HTML). We did not include this yet because that would primarily benchmark the HTML-to-WebLink parsing step, not Compass itself. Compass validation happens on the WebLink model.
  • Any network-based behavior, such as:
    • dereferencing target URLs,
    • following Link Sets remotely,
    • HTTP content negotiation (conneg),
    • validating “does this URL resolve?” or checking response codes at runtime.

If you need those aspects, they belong in the consuming application (network client + parsers), not in Compass’ offline validator.


Repository layout (high level)

  • benchmarks/
    Input definitions for which A2A scenarios to harvest (scenario list YAML).

  • fixtures/
    Offline captured HTTP response headers per scenario (one YAML file per scenario).

  • src/main/java/...
    CLI utility to harvest fixtures.

  • src/test/groovy/...
    The benchmark runner implemented as tests (Spock), which replay fixtures and assert results.

  • src/test/resources/expectations.yaml
    The “ground truth”: what relations and issues are expected per fixture.


Prerequisites

  • Java 21
  • Maven 3.9+ (recommended)

Quick start: run the benchmark

From the project root:

mvn test

What happens:

  • All fixture files under fixtures/http/ are enumerated.
  • For each fixture, the recorded Link header values are parsed into Web Links.
  • Compass processes those Web Links.
  • The result is asserted against src/test/resources/expectations.yaml.

If everything matches expectations, the build is green.


(Optional) Regenerate / harvest fixtures

Fixtures are snapshots of HTTP response headers of the A2A benchmark pages. You only need to regenerate them if:

  • the upstream A2A pages changed,
  • you want to add new scenarios,
  • or you intentionally want to re-baseline the dataset.

1) Compile the project (so the harvester class is available)

mvn -DskipTests package

2) Run the harvester

java -cp target/classes life.qbic.compass.benchmark.A2AFixtureHarvester
--input benchmarks/a2a-scenario-uris.yaml
--out fixtures/http

To overwrite existing fixtures:

 java -cp target/classes life.qbic.compass.benchmark.A2AFixtureHarvester
--input benchmarks/a2a-scenario-uris.yaml
--out fixtures/http
--force

Notes

  • Harvesting requires network access.
  • Benchmark execution (mvn test) does not.

How expectations work

The benchmark is “apple-to-apple” because results are compared to a fixed, scenario-specific baseline.

Each fixture is identified by its filename (without .yml/.yaml).
That ID must have a matching entry in:

  • src/test/resources/expectations.yaml

Expectations can express:

1) Relation cardinalities

Example: “there must be at least one describedby link”.

some-fixture-id: 
  relations: 
    describedby: { min: 1 } 
    cite-as: { min: 1, max: 1 }

2) Coarse issue expectations (errors/warnings)

Example: “this scenario is faulty, Compass must emit an error”.

some-fixture-id:
  expectErrors: true
  expectWarnings: false

3) Specific issues that must appear

Example: “an ERROR mentioning cite-as must be present”.

some-fixture-id: 
  mustContainIssues: 
    - severity: ERROR 
      messageContains: "cite-as"

Interpreting benchmark results

✅ Success (green build)

All fixtures matched their expectations:

  • relations were discovered with the expected minimum/maximum counts, and
  • Compass produced the expected issue types/messages.

This indicates Compass behavior is consistent with the current baseline.

❌ Failure (red build)

A failure means at least one scenario deviated from expectations. This is the key “benchmark signal”.

Typical failure modes and what they mean:

1) Relation cardinality mismatch

You may see an assertion like:

  • expected min=1 but was 0

Interpretation:

  • Compass (or link parsing) did not discover a relation that the baseline expects.
  • This could be a regression, a parsing change, or a fixture change.

2) Unexpected error/warning presence

You may see a message like:

  • expectErrors=false but was true

Interpretation:

  • Compass started reporting an error for a scenario that is considered “valid enough” by the baseline, or stopped reporting an error where one is expected.

This can be:

  • a Compass ruleset change,
  • a bugfix (baseline may need updating),
  • or a change in how issues are classified.

3) Missing expected issue message

You may see:

  • “Missing expected issue … messageContains=…”

Interpretation:

  • Compass did not emit the specific diagnostic the benchmark expects.
  • Often indicates a change in rule triggering or message wording.

What to do when a test fails

  1. Identify the failing fixture ID from the test output (it’s run per fixture).
  2. Open the corresponding fixture in fixtures/http/<fixture-id>.yml and inspect the recorded Link headers.
  3. Check src/test/resources/expectations.yaml for what was expected.
  4. Decide which side changed:
    • Compass changed (intended improvement/regression): update expectations (carefully).
    • Fixture changed (upstream changed): re-harvest fixtures and adjust expectations if needed.
    • Parsing changed: ensure Link parsing still matches the intended RFC behavior.

Adding a new scenario

  1. Add the scenario to benchmarks/a2a-scenario-uris.yaml.
  2. Harvest fixtures (see above).
  3. Add a matching entry to src/test/resources/expectations.yaml.
  4. Run:
mvn test

Reproducibility notes

  • Fixture harvesting is time-dependent (it records a fetch timestamp) and network-dependent (server behavior may change).
  • Once fixtures are committed/kept stable, benchmark runs are deterministic and offline.

License / attribution

This repository benchmarks Compass using the publicly available A2A FAIR Signposting scenarios.
If you redistribute fixtures or scenario definitions, ensure you comply with the upstream content’s terms.

About

Benchmark runner: evaluates Compass on the A2A FAIR Signposting test pages

Resources

License

Stars

Watchers

Forks

Packages

No packages published