Offline, apple-to-apple benchmark runner for FAIR Signposting (A2A) recipes using Compass.
This project is designed to answer one practical question:
Given a fixed set of “correct” and “faulty” FAIR Signposting examples, how well does the Compass validator detect the expected relations and issues?
It does that by:
- Harvesting HTTP response headers from the published A2A benchmark pages into offline fixtures (YAML files).
- Replaying those fixtures locally (no network needed for the benchmark run).
- Validating the discovered Web Links with Compass.
- Asserting the output against a curated set of expectations (relations + expected errors/warnings).
- An integration benchmark for Compass on a known dataset (A2A FAIR Signposting scenarios).
- Offline and reproducible once fixtures are harvested.
- A way to detect regressions: if Compass behavior changes, the benchmark will fail and show where.
- A performance benchmark (speed/throughput). It’s primarily correctness/behavior benchmarking.
- A crawler. During benchmark execution it does not fetch URLs; it only processes recorded HTTP headers.
This repository currently benchmarks only the A2A HTTP scenarios, because that is Compass’ primary focus: validating declared Web Links (e.g. HTTP Link headers) in an offline manner.
Not included (by design):
- Inline HTML signposting (e.g.
<link ...>tags embedded in HTML). We did not include this yet because that would primarily benchmark the HTML-to-WebLink parsing step, not Compass itself. Compass validation happens on the WebLink model. - Any network-based behavior, such as:
- dereferencing target URLs,
- following Link Sets remotely,
- HTTP content negotiation (conneg),
- validating “does this URL resolve?” or checking response codes at runtime.
If you need those aspects, they belong in the consuming application (network client + parsers), not in Compass’ offline validator.
-
benchmarks/
Input definitions for which A2A scenarios to harvest (scenario list YAML). -
fixtures/
Offline captured HTTP response headers per scenario (one YAML file per scenario). -
src/main/java/...
CLI utility to harvest fixtures. -
src/test/groovy/...
The benchmark runner implemented as tests (Spock), which replay fixtures and assert results. -
src/test/resources/expectations.yaml
The “ground truth”: what relations and issues are expected per fixture.
- Java 21
- Maven 3.9+ (recommended)
From the project root:
mvn testWhat happens:
- All fixture files under
fixtures/http/are enumerated. - For each fixture, the recorded
Linkheader values are parsed into Web Links. - Compass processes those Web Links.
- The result is asserted against
src/test/resources/expectations.yaml.
If everything matches expectations, the build is green.
Fixtures are snapshots of HTTP response headers of the A2A benchmark pages. You only need to regenerate them if:
- the upstream A2A pages changed,
- you want to add new scenarios,
- or you intentionally want to re-baseline the dataset.
mvn -DskipTests packagejava -cp target/classes life.qbic.compass.benchmark.A2AFixtureHarvester
--input benchmarks/a2a-scenario-uris.yaml
--out fixtures/httpTo overwrite existing fixtures:
java -cp target/classes life.qbic.compass.benchmark.A2AFixtureHarvester
--input benchmarks/a2a-scenario-uris.yaml
--out fixtures/http
--forceNotes
- Harvesting requires network access.
- Benchmark execution (
mvn test) does not.
The benchmark is “apple-to-apple” because results are compared to a fixed, scenario-specific baseline.
Each fixture is identified by its filename (without .yml/.yaml).
That ID must have a matching entry in:
src/test/resources/expectations.yaml
Expectations can express:
Example: “there must be at least one describedby link”.
some-fixture-id:
relations:
describedby: { min: 1 }
cite-as: { min: 1, max: 1 }Example: “this scenario is faulty, Compass must emit an error”.
some-fixture-id:
expectErrors: true
expectWarnings: falseExample: “an ERROR mentioning cite-as must be present”.
some-fixture-id:
mustContainIssues:
- severity: ERROR
messageContains: "cite-as"All fixtures matched their expectations:
- relations were discovered with the expected minimum/maximum counts, and
- Compass produced the expected issue types/messages.
This indicates Compass behavior is consistent with the current baseline.
A failure means at least one scenario deviated from expectations. This is the key “benchmark signal”.
Typical failure modes and what they mean:
You may see an assertion like:
- expected
min=1but was0
Interpretation:
- Compass (or link parsing) did not discover a relation that the baseline expects.
- This could be a regression, a parsing change, or a fixture change.
You may see a message like:
expectErrors=false but was true
Interpretation:
- Compass started reporting an error for a scenario that is considered “valid enough” by the baseline, or stopped reporting an error where one is expected.
This can be:
- a Compass ruleset change,
- a bugfix (baseline may need updating),
- or a change in how issues are classified.
You may see:
- “Missing expected issue … messageContains=…”
Interpretation:
- Compass did not emit the specific diagnostic the benchmark expects.
- Often indicates a change in rule triggering or message wording.
- Identify the failing fixture ID from the test output (it’s run per fixture).
- Open the corresponding fixture in
fixtures/http/<fixture-id>.ymland inspect the recordedLinkheaders. - Check
src/test/resources/expectations.yamlfor what was expected. - Decide which side changed:
- Compass changed (intended improvement/regression): update expectations (carefully).
- Fixture changed (upstream changed): re-harvest fixtures and adjust expectations if needed.
- Parsing changed: ensure Link parsing still matches the intended RFC behavior.
- Add the scenario to
benchmarks/a2a-scenario-uris.yaml. - Harvest fixtures (see above).
- Add a matching entry to
src/test/resources/expectations.yaml. - Run:
mvn test- Fixture harvesting is time-dependent (it records a fetch timestamp) and network-dependent (server behavior may change).
- Once fixtures are committed/kept stable, benchmark runs are deterministic and offline.
This repository benchmarks Compass using the publicly available A2A FAIR Signposting scenarios.
If you redistribute fixtures or scenario definitions, ensure you comply with the upstream content’s terms.