Skip to content

Conversation

casenave
Copy link
Member

@casenave casenave commented Sep 26, 2025

Checklist

  • Typing enforced
  • Documentation updated
  • Changelog updated
  • Tests and Example updates
  • Coverage should be 100%

PR Summary

This PR modernizes the Hugging Face bridge by:

  • Supporting DatasetDict: leverage the native split mechanism for efficient partial dataset access.
  • Simplifying metadata handling: remove cumbersome tricks for setting problem_definition and infos in the dataset description; provide clear functions to load/save them via JSON and YAML.
  • Enabling multiple problem definitions per repo: allow the bridge to handle different problem definitions seamlessly.
  • I/O functions: add load/save utilities that work both with Hugging Face Hub repositories and the local filesystem.

HF PLAID datasets conversion

Current PLAID datasets can be downloaded, converted and uploaded with:

from plaid.bridges import huggingface_bridge
import datasets
import os

os.environ["HF_HUB_DISABLE_XET"] = "1"  # temporary (?) trick

source_repo_id = "PLAID-datasets/Tensile2d"
target_repo_id = "fabiencasenave/Tensile2d_converted"

hf_dataset = datasets.load_dataset("PLAID-datasets/Tensile2d", split="all_samples")
dataset = huggingface_bridge.huggingface_dataset_to_plaid(hf_dataset, processes_number = 12, verbose = True)

pb_def = huggingface_bridge.huggingface_description_to_problem_definition(hf_dataset.description)
infos = huggingface_bridge.huggingface_description_to_infos(hf_dataset.description)

main_splits = ["train_500", "test", "OOD"]
hf_dataset_dict = huggingface_bridge.plaid_dataset_to_huggingface_datasetdict(dataset, pb_def, main_splits)

huggingface_bridge.push_dataset_dict_to_hub(target_repo_id, hf_dataset_dict)
huggingface_bridge.push_dataset_infos_to_hub(target_repo_id, infos)
huggingface_bridge.push_problem_definition_to_hub(target_repo_id, "task_1", pb_def)

Results here

Then:

hf_dataset = huggingface_bridge.load_hf_dataset_from_hub(target_repo_id, split='train_500[:10]')
infos = huggingface_bridge.load_hf_infos_from_hub(target_repo_id)
pb_def = huggingface_bridge.load_hf_problem_definition_from_hub(target_repo_id, "task_1")

print(f"{hf_dataset = }")
print('--')
print(f"{infos = }")
print('--')
print(f"{pb_def = }")

gives:

hf_dataset = Dataset({
    features: ['sample'],
    num_rows: 10
})
--
infos = {'data_production': {'physics': '2D quasistatic non-linear structural mechanics, small deformations, plane strain', 'type': 'simulation'}, 'legal': {'license': 'CC-BY-SA', 'owner': 'Safran'}}
--
pb_def = ProblemDefinition(input_scalars_names=['P', 'p1', 'p2', 'p3', 'p4', 'p5'], output_scalars_names=['max_von_mises', 'max_q', 'max_U2_top', 'max_sig22_top'], output_fields_names=['U1', 'U2', 'q', 'sig11', 'sig12', 'sig22'], input_meshes_names=['/Base_2_2/Zone'], task='regression')

Splits

These modification drop the support for "subsplits": we only rely on "main split", the one defined natively through the DatasetDict. Hence, splits in problem_definition are ignored by the Hugging Face bridge.

🔗 Related issues

Addresses tasks from

#160
#241
#219

@casenave casenave requested a review from a team as a code owner September 26, 2025 19:14
@casenave casenave marked this pull request as draft September 26, 2025 19:14
Copy link

codecov bot commented Sep 26, 2025

Codecov Report

❌ Patch coverage is 30.00000% with 252 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/plaid/utils/cgns_helper.py 7.96% 185 Missing ⚠️
src/plaid/bridges/huggingface_bridge.py 51.56% 62 Missing ⚠️
src/plaid/utils/base.py 44.44% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@casenave casenave changed the title ♻️ Modernize the Hugging Face bridge and the split mechanisms ♻️ Modernize the Hugging Face bridge Sep 27, 2025
@casenave casenave mentioned this pull request Sep 27, 2025
18 tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this file ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants