Added RAI metadata support with PROV-O, using a YAML interface.#34
Added RAI metadata support with PROV-O, using a YAML interface.#34JoanGi wants to merge 2 commits intoMIT-LCP:mainfrom
Conversation
|
I have been thinking about the more manual / semantic tasks (there are some indicated as not tackled in my Parquet extension / Open Targets PR). Most attractive at the moment seems to be to expose very clearly and unambiguously those parts that require expert input. That would in turn allow to defer the usage to an LLM-assisted system (importantly, outside of croissant-maker). YAML config is a good low threshold starting point. From the schema perspective, it's not fully clear to me why If we start with special treatment flags, we may have to add many flags subsequently for any other extensions. But could also be that generalising this is premature optimisation. Happy to discuss. |
|
Thanks @JoanGi! Sorry for the hassle, but please could you rebase on main when you have an opportunity, and also update to account for the switch from poetry to uv? |
tompollard
left a comment
There was a problem hiding this comment.
Some minor style issues at: https://github.com/MIT-LCP/croissant-maker/actions/runs/23552112223/job/68578859385
src/croissant_maker/rai/__init__.py
Outdated
| from croissant_maker.rai.loader import load_rai_config | ||
| from croissant_maker.rai.schema import RAIConfig | ||
|
|
||
| __all__ = ["load_rai_config", "inject_rai", "RAIConfig"] No newline at end of file |
There was a problem hiding this comment.
please could you double check that all files are ending with a newline?
c583929 to
ed12a93
Compare
The RAI specification has been considered an extension for the task force until now, as they mature over time and over stakeholders (would new data documentation frameworks or dimensions appear in the future, e.g., concerns about synthetic data?). Some of the mechanisms (such as provenance and data-use conditions) have been integrated into the main spec in version 1.1. |
|
Test passed. A brief summary of the last commits:
|
|
Thank you @JoanGi for this PR! - about the generic extension mechanism question So the concern is instead of a dedicated - the The Croissant 1.1 spec is out and PROV-O is part of it, but the On the rebase @JoanGi, could you confirm CI is green on the latest commits? Happy to help rebase onto current Also this is purely additive. No existing behaviour changes. Everything new is opt-in via @tompollard @slobentanzer What do you think? |
|
@rafiattrach follow your reasoning and fully agree; it's probably easier to keep a lookout and respond to changes with new flags or generalisations as they become relevant. From a user side, this means it should be made obvious that this flag exists and should particularly be used if RAI is a concern in the data to be annotated. Positioning in the docs and maybe an info logging message might be enough. |
|
The reasoning between RAI being "another" extension, or being a central part, has also been happening in Croissant community for a while. The idea is that when RAI aspects become mature enough, they would be added as part of the general spec as may apply to a wide variety of use cases. As this is still an open debate, I agree with @rafiattrach that we should open a new issue and keep the discussion there. We are just going to have a new release supporting PROV-O in the upcoming days!. We can wait until we have it to avoid unnecessary work. |
Add RAI metadata extension with PROV-O provenance support
This PR aims to initiate discussion on how we can add Croissant RAI metadata support to croissant-maker and, more broadly, to the set of tools in this ecosystem (such as the editor). Let me know what you think!
Motivation
RAI metadata presents a unique challenge compared to the rest of the Croissant spec:
rai:SocialImpactorrai:Biasesrequire deliberate authorship.mlcroissant(1.0.22) does not yet include PROV-O support; this feature requires installing from the latest commit.Approach
This PR addresses the above by proposing a YAML interface that lets authors create their RAI annotations in a human-friendly format, while croissant-maker handles the translation into spec-compliant, machine-actionable Croissant JSON-LD.
To faciliatate the user's journey, a teamplate is provided at the root of the project (please, change it at your desire), and an example is proposed at tests/data/input/mimiciv_demo/physionet.org/mimiciv_demo-rai-example.yaml
How to use it
When generating a new Croissant file, pass --rai-config:
croissant-maker --input ./my-dataset --creator "Name" --rai-config rai.yamlWhen enriching an existing Croissant file, use the new rai-apply subcommand:
or write to a separate file:
Examples and templates
rai-example.yaml— fully documented template with every supported field and inline comments explaining the RAI/PROV-O mappingtests/data/input/mimiciv_demo/physionet.org/mimiciv_demo-rai-example.yaml— real-world example using the MIMIC-IV demo datasettests/data/output/mimiciv_demo_croissant_rai.jsonld— reference output showing the resulting Croissant JSON-LDChanges
src/croissant_maker/rai/— new module with schema (Pydantic models), YAML loader, and injectorsrc/croissant_maker/__main__.py— adds--rai-configflag to the generate command and therai-applysubcommandtests/test_rai.py— integration test that runs the full pipeline against the MIMIC-IV demo and validates output against the reference fileBroader discussion