Compass Apple-2-Apple runner

Offline, apple-to-apple benchmark runner for FAIR Signposting (A2A) recipes using Compass.

This project is designed to answer one practical question:

Given a fixed set of “correct” and “faulty” FAIR Signposting examples, how well does the Compass validator detect the expected relations and issues?

It does that by:

Harvesting HTTP response headers from the published A2A benchmark pages into offline fixtures (YAML files).
Replaying those fixtures locally (no network needed for the benchmark run).
Validating the discovered Web Links with Compass.
Asserting the output against a curated set of expectations (relations + expected errors/warnings).

What this runner is (and is not)

✅ It is

An integration benchmark for Compass on a known dataset (A2A FAIR Signposting scenarios).
Offline and reproducible once fixtures are harvested.
A way to detect regressions: if Compass behavior changes, the benchmark will fail and show where.

❌ It is not

A performance benchmark (speed/throughput). It’s primarily correctness/behavior benchmarking.
A crawler. During benchmark execution it does not fetch URLs; it only processes recorded HTTP headers.

Benchmark scope (important)

This repository currently benchmarks only the A2A HTTP scenarios, because that is Compass’ primary focus: validating declared Web Links (e.g. HTTP Link headers) in an offline manner.

Not included (by design):

Inline HTML signposting (e.g. <link ...> tags embedded in HTML). We did not include this yet because that would primarily benchmark the HTML-to-WebLink parsing step, not Compass itself. Compass validation happens on the WebLink model.
Any network-based behavior, such as:
- dereferencing target URLs,
- following Link Sets remotely,
- HTTP content negotiation (conneg),
- validating “does this URL resolve?” or checking response codes at runtime.

If you need those aspects, they belong in the consuming application (network client + parsers), not in Compass’ offline validator.

Repository layout (high level)

benchmarks/
Input definitions for which A2A scenarios to harvest (scenario list YAML).
fixtures/
Offline captured HTTP response headers per scenario (one YAML file per scenario).
src/main/java/...
CLI utility to harvest fixtures.
src/test/groovy/...
The benchmark runner implemented as tests (Spock), which replay fixtures and assert results.
src/test/resources/expectations.yaml
The “ground truth”: what relations and issues are expected per fixture.

Prerequisites

Java 21
Maven 3.9+ (recommended)

Quick start: run the benchmark

From the project root:

mvn test

What happens:

All fixture files under fixtures/http/ are enumerated.
For each fixture, the recorded Link header values are parsed into Web Links.
Compass processes those Web Links.
The result is asserted against src/test/resources/expectations.yaml.

If everything matches expectations, the build is green.

(Optional) Regenerate / harvest fixtures

Fixtures are snapshots of HTTP response headers of the A2A benchmark pages. You only need to regenerate them if:

the upstream A2A pages changed,
you want to add new scenarios,
or you intentionally want to re-baseline the dataset.

1) Compile the project (so the harvester class is available)

mvn -DskipTests package

2) Run the harvester

java -cp target/classes life.qbic.compass.benchmark.A2AFixtureHarvester
--input benchmarks/a2a-scenario-uris.yaml
--out fixtures/http

To overwrite existing fixtures:

 java -cp target/classes life.qbic.compass.benchmark.A2AFixtureHarvester
--input benchmarks/a2a-scenario-uris.yaml
--out fixtures/http
--force

Notes

Harvesting requires network access.
Benchmark execution (mvn test) does not.

How expectations work

The benchmark is “apple-to-apple” because results are compared to a fixed, scenario-specific baseline.

Each fixture is identified by its filename (without .yml/.yaml).
That ID must have a matching entry in:

src/test/resources/expectations.yaml

Expectations can express:

1) Relation cardinalities

Example: “there must be at least one describedby link”.

some-fixture-id: 
  relations: 
    describedby: { min: 1 } 
    cite-as: { min: 1, max: 1 }

2) Coarse issue expectations (errors/warnings)

Example: “this scenario is faulty, Compass must emit an error”.

some-fixture-id:
  expectErrors: true
  expectWarnings: false

3) Specific issues that must appear

Example: “an ERROR mentioning cite-as must be present”.

some-fixture-id: 
  mustContainIssues: 
    - severity: ERROR 
      messageContains: "cite-as"

Interpreting benchmark results

✅ Success (green build)

All fixtures matched their expectations:

relations were discovered with the expected minimum/maximum counts, and
Compass produced the expected issue types/messages.

This indicates Compass behavior is consistent with the current baseline.

❌ Failure (red build)

A failure means at least one scenario deviated from expectations. This is the key “benchmark signal”.

Typical failure modes and what they mean:

1) Relation cardinality mismatch

You may see an assertion like:

expected min=1 but was 0

Interpretation:

Compass (or link parsing) did not discover a relation that the baseline expects.
This could be a regression, a parsing change, or a fixture change.

2) Unexpected error/warning presence

You may see a message like:

expectErrors=false but was true

Interpretation:

Compass started reporting an error for a scenario that is considered “valid enough” by the baseline, or stopped reporting an error where one is expected.

This can be:

a Compass ruleset change,
a bugfix (baseline may need updating),
or a change in how issues are classified.

3) Missing expected issue message

You may see:

“Missing expected issue … messageContains=…”

Interpretation:

Compass did not emit the specific diagnostic the benchmark expects.
Often indicates a change in rule triggering or message wording.

What to do when a test fails

Identify the failing fixture ID from the test output (it’s run per fixture).
Open the corresponding fixture in fixtures/http/<fixture-id>.yml and inspect the recorded Link headers.
Check src/test/resources/expectations.yaml for what was expected.
Decide which side changed:
- Compass changed (intended improvement/regression): update expectations (carefully).
- Fixture changed (upstream changed): re-harvest fixtures and adjust expectations if needed.
- Parsing changed: ensure Link parsing still matches the intended RFC behavior.

Adding a new scenario

Add the scenario to benchmarks/a2a-scenario-uris.yaml.
Harvest fixtures (see above).
Add a matching entry to src/test/resources/expectations.yaml.
Run:

mvn test

Reproducibility notes

Fixture harvesting is time-dependent (it records a fetch timestamp) and network-dependent (server behavior may change).
Once fixtures are committed/kept stable, benchmark runs are deterministic and offline.

License / attribution

This repository benchmarks Compass using the publicly available A2A FAIR Signposting scenarios.
If you redistribute fixtures or scenario definitions, ensure you comply with the upstream content’s terms.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
.idea		.idea
benchmarks		benchmarks
fixtures/http		fixtures/http
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compass Apple-2-Apple runner

What this runner is (and is not)

✅ It is

❌ It is not

Benchmark scope (important)

Repository layout (high level)

Prerequisites

Quick start: run the benchmark

(Optional) Regenerate / harvest fixtures

1) Compile the project (so the harvester class is available)

2) Run the harvester

How expectations work

1) Relation cardinalities

2) Coarse issue expectations (errors/warnings)

3) Specific issues that must appear

Interpreting benchmark results

✅ Success (green build)

❌ Failure (red build)

1) Relation cardinality mismatch

2) Unexpected error/warning presence

3) Missing expected issue message

What to do when a test fails

Adding a new scenario

Reproducibility notes

License / attribution

About

Uh oh!

Releases 1

Packages

Languages

License

qbicsoftware/compass-a2a-runner

Folders and files

Latest commit

History

Repository files navigation

Compass Apple-2-Apple runner

What this runner is (and is not)

✅ It is

❌ It is not

Benchmark scope (important)

Repository layout (high level)

Prerequisites

Quick start: run the benchmark

(Optional) Regenerate / harvest fixtures

1) Compile the project (so the harvester class is available)

2) Run the harvester

How expectations work

1) Relation cardinalities

2) Coarse issue expectations (errors/warnings)

3) Specific issues that must appear

Interpreting benchmark results

✅ Success (green build)

❌ Failure (red build)

1) Relation cardinality mismatch

2) Unexpected error/warning presence

3) Missing expected issue message

What to do when a test fails

Adding a new scenario

Reproducibility notes

License / attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages