A11y LLM Evaluation Harness and Dataset

This is a research project to evaluate how well various LLM models generate accessible HTML content.

Problem

LLMs currently generate code with accessibility bugs, resulting in blockers for people with disabilities and costly re-work and fixes downstream.

Goal

Create a public test suite which can be used to benchmark how well various LLMs generates accessible HTML code. Eventually, it could also be used to help train models to generate more accessible code by default.

Methdology

Each test case contains a prompt to generate an HTML page to demonstrate a specific pattern or component.
This page is rendered in a real browser using Playwright (Chromium). Tests are executed against this rendered page.
The HTML is evaluated against axe-core, one of the most popular automated accessibility testing engines.
The HTML is also evaluated against a manually defined set of assertions, customized for the specific test case. This allows for more robust testing than just using axe-core.
Tests only pass if zero axe-core failures are found AND all requirement assertions pass. Best Practice (BP) assertion failures do not fail the test but are tracked separately.

Features

Python orchestrator (generation, execution, reporting)
Node.js Playwright + axe-core evaluation
Per-test prompts & injected JS assertions
HTML report summarizing performance
Token + cost tracking (tokens in/out/total, per-generation cost, aggregated per model)
Multi-sample generation with pass@k metrics (probability at least one passing generation in k draws)

Sampling & pass@k Metrics

You can request multiple independent generations ("samples") per (test, model). This enables computation of pass@k metrics similar to code evaluation benchmarks.

CLI Usage

Step 1: Send prompts to the LLMs and generate HTML

python -m a11y_llm_tests.cli run \
  --models-file config/models.yaml \
  --out runs \
  --samples 20 \

Step 2: Run the eval and generate the report

python -m a11y_llm_tests.cli evaluate \
  <path to run directory>
  --k 1,5,10

Artifacts:

Each sample's HTML: runs/<ts>/raw/<test>/<model>__s<idx>.html (single-sample keeps legacy <model>.html)
Screenshots with analogous naming
results.json now includes per-sample records + an aggregates array with pass@k stats.
Report includes an aggregate pass@k table and grouped per-sample cards.

Tips:

Increase temperature (or other diversity params) to reduce sample correlation.
Use --disable-cache if you want fresh generations even when prompt/model/seed repeat.

Quick Start

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

# Node deps
bash scripts/install_node_deps.sh

# Copy env and set keys
cp .env.example .env
export OPENAI_API_KEY=... # etc. or put in .env and use dotenv

# Copy model config and set API keys
cp config/models.yaml.example config/models.yaml

# Run all tests against configured models
python -m a11y_llm_tests.cli run --models-file config/models.yaml --out runs

Adding a Test Case

Create a new folder under test_cases/:

test_cases/
  form-labels/
    prompt.md
    test.js
    example-fail/
    example-pass/

prompt.md contains ONLY the user-facing instruction for the model.

test.js must export:

module.exports.run = async ({ page, assert }) => {
  await assert("Has an h1", async () => {
    const count = await page.$$eval('h1', els => els.length);
    return count >= 1; // truthy => pass, falsy => fail
  });
  await assert("Sequential heading levels", async () => {
    // Return object form to include custom message
    const ok = await page.$$eval('h1 + h2', els => els.length) > 0;
    return { pass: ok, message: ok ? undefined : 'h2 does not follow h1' };
  }, { type: 'BP' });
  return {}; // assertions collected automatically
};

The runner injects an assert(name, fn, opts?) helper:

Parameter	Description
`name`	Human-readable assertion label
`fn`	Async/Sync function returning boolean OR `{ pass, message? }`
`opts.type`	`'R'` (Requirement, default) or `'BP'` (Best Practice)

Return shape from run can be empty.

Assertion Types

Each assertion may now include a type field:

Type	Meaning	Affects Test Pass/Fail	Aggregated Separately
`R`	Requirement (default)	Yes (any failing R => test fails)	Requirement Pass Rate
`BP`	Best Practice	No (ignored for pass/fail)	Best Practice Pass Rate

If type is omitted it defaults to R for backward compatibility. The HTML report shows both Requirement Pass Rate (percentage of tests whose requirement assertions passed) and Best Practice Pass Rate (percentage of tests containing BP assertions where all BP assertions passed).

Example assertion objects returned from run:

return {
  assertions: [
    { name: 'has main landmark', status: 'pass', type: 'R' },
    { name: 'images have alt text', status: 'fail', type: 'BP', message: '1 of 5 images missing alt' }
  ]
};

Report

Generated at runs/<timestamp>/report.html with:

Summary stats per model
Detailed per model/test breakdown
Axe violations
Assertions & statuses
Pass@k aggregate table and per-sample cards when multiple samples are collected

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
a11y_llm_tests		a11y_llm_tests
config		config
node_runner		node_runner
scripts		scripts
test_cases		test_cases
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A11y LLM Evaluation Harness and Dataset

Problem

Goal

Methdology

Features

Sampling & pass@k Metrics

CLI Usage

Quick Start

Adding a Test Case

Assertion Types

Report

Contributing

Trademarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

microsoft/a11y-llm-eval

Folders and files

Latest commit

History

Repository files navigation

A11y LLM Evaluation Harness and Dataset

Problem

Goal

Methdology

Features

Sampling & pass@k Metrics

CLI Usage

Quick Start

Adding a Test Case

Assertion Types

Report

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages