Added RAI metadata support with PROV-O, using a YAML interface. by JoanGi · Pull Request #34 · MIT-LCP/croissant-baker

JoanGi · 2026-03-25T16:28:34Z

Add RAI metadata extension with PROV-O provenance support

This PR aims to initiate discussion on how we can add Croissant RAI metadata support to croissant-maker and, more broadly, to the set of tools in this ecosystem (such as the editor). Let me know what you think!

Motivation

RAI metadata presents a unique challenge compared to the rest of the Croissant spec:

Most attributes live at the dataset level and cannot be automatically extrapolated from the data structure — fields like rai:SocialImpact or rai:Biases require deliberate authorship.
Many natural-language attributes require critical thinking by dataset authors; some of the seed works the RAI spec is inspired by reflect this human judgement.
Newer mechanisms, such as PROV-O provenance, allow us to organise this information in a more machine-readable way — and tools should support it. Note that the last published release of mlcroissant (1.0.22) does not yet include PROV-O support; this feature requires installing from the latest commit.

Approach

This PR addresses the above by proposing a YAML interface that lets authors create their RAI annotations in a human-friendly format, while croissant-maker handles the translation into spec-compliant, machine-actionable Croissant JSON-LD.

To faciliatate the user's journey, a teamplate is provided at the root of the project (please, change it at your desire), and an example is proposed at tests/data/input/mimiciv_demo/physionet.org/mimiciv_demo-rai-example.yaml

How to use it

When generating a new Croissant file, pass --rai-config:

croissant-maker --input ./my-dataset --creator "Name" --rai-config rai.yaml

When enriching an existing Croissant file, use the new rai-apply subcommand:

croissant-maker rai-apply dataset.jsonld --rai-config rai.yaml

or write to a separate file:

croissant-maker rai-apply dataset.jsonld --rai-config rai.yaml --output dataset-rai.jsonld

Examples and templates

rai-example.yaml — fully documented template with every supported field and inline comments explaining the RAI/PROV-O mapping
tests/data/input/mimiciv_demo/physionet.org/mimiciv_demo-rai-example.yaml — real-world example using the MIMIC-IV demo dataset
tests/data/output/mimiciv_demo_croissant_rai.jsonld — reference output showing the resulting Croissant JSON-LD

Changes

src/croissant_maker/rai/ — new module with schema (Pydantic models), YAML loader, and injector
src/croissant_maker/__main__.py — adds --rai-config flag to the generate command and the rai-apply subcommand
tests/test_rai.py — integration test that runs the full pipeline against the MIMIC-IV demo and validates output against the reference file

Broader discussion

The YAML interface is intentionally designed as a decoupled layer: if other tools — such as form-based editors, Croissant Miner outputs, or a future web-based form — can produce this YAML, then croissant-maker can act as the backend that maps it to the spec in the correct form. This opens a few paths worth discussing:
Form-based tools: a web form (conversations on this are ongoing) could allow users to fill in RAI fields interactively, exporting a YAML that croissant-maker then processes.
Croissant Miner integration: if Miner can emit a partial YAML from what it can infer, authors only need to fill in the fields that require human judgement.
Editor integration: the rai-apply subcommand means RAI metadata can be added to an existing Croissant file without regenerating it, which suits an editor workflow naturally.

slobentanzer · 2026-03-25T17:02:17Z

I have been thinking about the more manual / semantic tasks (there are some indicated as not tackled in my Parquet extension / Open Targets PR). Most attractive at the moment seems to be to expose very clearly and unambiguously those parts that require expert input. That would in turn allow to defer the usage to an LLM-assisted system (importantly, outside of croissant-maker). YAML config is a good low threshold starting point.

From the schema perspective, it's not fully clear to me why rai should receive "special treatment"; but that may also be my own ignorance. Wouldn't it be better to support all present and future attributes generically? @JoanGi maybe I am missing something about the tech background that makes rai terms different. Naively, I would have expected that the most important aspect is that the terms are clearly described; then it shouldn't make a difference if it's rai, cr, or any other vocabulary.

If we start with special treatment flags, we may have to add many flags subsequently for any other extensions. But could also be that generalising this is premature optimisation. Happy to discuss.

tompollard · 2026-03-27T20:15:35Z

Thanks @JoanGi! Sorry for the hassle, but please could you rebase on main when you have an opportunity, and also update to account for the switch from poetry to uv?

tompollard

Some minor style issues at: https://github.com/MIT-LCP/croissant-maker/actions/runs/23552112223/job/68578859385

tompollard · 2026-03-27T20:16:45Z

src/croissant_maker/rai/__init__.py

+from croissant_maker.rai.loader import load_rai_config
+from croissant_maker.rai.schema import RAIConfig
+
+__all__ = ["load_rai_config", "inject_rai", "RAIConfig"]


please could you double check that all files are ending with a newline?

JoanGi · 2026-03-31T09:37:22Z

I have been thinking about the more manual / semantic tasks (there are some indicated as not tackled in my Parquet extension / Open Targets PR). Most attractive at the moment seems to be to expose very clearly and unambiguously those parts that require expert input. That would in turn allow to defer the usage to an LLM-assisted system (importantly, outside of croissant-maker). YAML config is a good low threshold starting point.

From the schema perspective, it's not fully clear to me why rai should receive "special treatment"; but that may also be my own ignorance. Wouldn't it be better to support all present and future attributes generically? @JoanGi maybe I am missing something about the tech background that makes rai terms different. Naively, I would have expected that the most important aspect is that the terms are clearly described; then it shouldn't make a difference if it's rai, cr, or any other vocabulary.

If we start with special treatment flags, we may have to add many flags subsequently for any other extensions. But could also be that generalising this is premature optimisation. Happy to discuss.

The RAI specification has been considered an extension for the task force until now, as they mature over time and over stakeholders (would new data documentation frameworks or dimensions appear in the future, e.g., concerns about synthetic data?). Some of the mechanisms (such as provenance and data-use conditions) have been integrated into the main spec in version 1.1.

JoanGi · 2026-03-31T09:38:44Z

Test passed. A brief summary of the last commits:

Rebased on upstream/main
Resolved pyproject. toml conflict: kept upstream's style, added pyyaml>=6.0 for RAI
Added trailing newlines to 5 files that were missing them
Removed poetry.lock from the commit (upstream uses uv.lock)
Renamed rai-teamplate.yaml → rai-example.yaml (typo fix)

rafiattrach · 2026-04-01T19:50:47Z

Thank you @JoanGi for this PR!

- about the generic extension mechanism question

So the concern is instead of a dedicated --rai-config flag, should we have a generic --extension-config flag that works for any future Croissant extension? It is a fair question for the long run, but RAI has a strong case for being treated explicitly. It has its own official namespace (rai:), its own MLCommons working group, and a direct regulatory driver: EU AI Act Articles 10 and 53 require structured dataset documentation for high-risk AI systems. That is a different weight than a hypothetical future extension. Building a generic registry and per-extension validation layer before we have a second concrete extension to generalise from would be speculative. I would suggest we open a follow-up issue to track that direction rather than blocking here.

- the --no-validate in tests

The Croissant 1.1 spec is out and PROV-O is part of it, but the mlcroissant Python library has not caught up yet. Validation would false-positive on correct output. The test flags this explicitly and the workaround is temporary.

On the rebase

@JoanGi, could you confirm CI is green on the latest commits? Happy to help rebase onto current main if useful, just let me know. Once CI is confirmed I think this could be ready.

Also this is purely additive. No existing behaviour changes. Everything new is opt-in via --rai-config or the rai-apply subcommand.

@tompollard @slobentanzer What do you think?

slobentanzer · 2026-04-01T20:04:53Z

@rafiattrach follow your reasoning and fully agree; it's probably easier to keep a lookout and respond to changes with new flags or generalisations as they become relevant. From a user side, this means it should be made obvious that this flag exists and should particularly be used if RAI is a concern in the data to be annotated. Positioning in the docs and maybe an info logging message might be enough.

JoanGi · 2026-04-02T09:39:56Z

The reasoning between RAI being "another" extension, or being a central part, has also been happening in Croissant community for a while. The idea is that when RAI aspects become mature enough, they would be added as part of the general spec as may apply to a wide variety of use cases. As this is still an open debate, I agree with @rafiattrach that we should open a new issue and keep the discussion there.

We are just going to have a new release supporting PROV-O in the upcoming days!. We can wait until we have it to avoid unnecessary work.

tompollard requested changes Mar 27, 2026

View reviewed changes

Added RAI metadata support with PROV-O, using a YAML interface.

ed12a93

JoanGi force-pushed the feat/rai-metadata-extension branch from c583929 to ed12a93 Compare March 31, 2026 09:27

passed uv

8770316

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added RAI metadata support with PROV-O, using a YAML interface.#34

Added RAI metadata support with PROV-O, using a YAML interface.#34
JoanGi wants to merge 2 commits intoMIT-LCP:mainfrom
JoanGi:feat/rai-metadata-extension

JoanGi commented Mar 25, 2026

Uh oh!

slobentanzer commented Mar 25, 2026

Uh oh!

tompollard commented Mar 27, 2026

Uh oh!

tompollard left a comment

Uh oh!

tompollard Mar 27, 2026

Uh oh!

JoanGi commented Mar 31, 2026

Uh oh!

JoanGi commented Mar 31, 2026

Uh oh!

rafiattrach commented Apr 1, 2026

Uh oh!

slobentanzer commented Apr 1, 2026

Uh oh!

JoanGi commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JoanGi commented Mar 25, 2026

Add RAI metadata extension with PROV-O provenance support

Motivation

Approach

How to use it

Examples and templates

Changes

Broader discussion

Uh oh!

slobentanzer commented Mar 25, 2026

Uh oh!

tompollard commented Mar 27, 2026

Uh oh!

tompollard left a comment

Choose a reason for hiding this comment

Uh oh!

tompollard Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

JoanGi commented Mar 31, 2026

Uh oh!

JoanGi commented Mar 31, 2026

Uh oh!

rafiattrach commented Apr 1, 2026

Uh oh!

slobentanzer commented Apr 1, 2026

Uh oh!

JoanGi commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants