diff --git a/.gitleaksignore b/.gitleaksignore new file mode 100644 index 0000000..5073416 --- /dev/null +++ b/.gitleaksignore @@ -0,0 +1 @@ +3340688aefb8708b0d945abccaeb49e44932a64d:src/tutorials/manage_dataset_with_shell.md:curl-auth-header:122 diff --git a/mkdocs.yml b/mkdocs.yml index 2e2a686..1253839 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -110,11 +110,8 @@ nav: - Metadata: - share/metadata/index.md - share/metadata/configuration.md - - share/metadata/reference.md - - Examples: - - share/metadata/model-case.md - - share/metadata/dataset-case.md - share/metadata/stac-concepts.md + - share/metadata/reference.md - Data: - share/data/index.md - Git LFS: share/data/git-lfs.md @@ -123,8 +120,12 @@ nav: - share/ml-tracking/index.md - GitLab Integration: share/ml-tracking/gitlab.md - SharingHub Integration: share/ml-tracking/mlflow-sharinghub.md + - Examples: + - share/examples/model-case.md + - share/examples/dataset-case.md - Tutorials: - - tutorials/manage_dataset_with_dvc.md + - tutorials/dataset_with_dvc.md + - tutorials/dataset_with_huggingface_interface.md - Resources: - resources/index.md - resources/stac.md diff --git a/requirements.txt b/requirements.txt index 1512d45..1f4e555 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ mkdocs~=1.5 mkdocs-macros-plugin~=1.0 mkdocs-material~=9.5 -mkdocs-swagger-ui-tag==0.6.10 +mkdocs-swagger-ui-tag==0.6.11 diff --git a/src/explore/project-view.md b/src/explore/project-view.md index f16f2a4..6e2b26a 100644 --- a/src/explore/project-view.md +++ b/src/explore/project-view.md @@ -29,7 +29,7 @@ The STAC helper display download code with the STAC API by using Python `request ### DVC The DVC button displays configuration and setup of DVC for the project. -The DVC service enables you to store large volumes of data. The DVC button is therefore only displayed for projects in the "Datasets" category, which have a DVC configuration in the source GitLab project. This button displays the remote DVC configuration link and additional information on the DVC [documentation](https://dvc.org/doc) and [tutorial](../tutorials/manage_dataset_with_dvc.md). +The DVC service enables you to store large volumes of data. The DVC button is therefore only displayed for projects in the "Datasets" category, which have a DVC configuration in the source GitLab project. This button displays the remote DVC configuration link and additional information on the DVC [documentation](https://dvc.org/doc) and [tutorial](../tutorials/dataset_with_dvc.md). ![DVC helper](../assets/figures/explore/project-view/dvc-helper.png) diff --git a/src/share/data/dvc.md b/src/share/data/dvc.md index 288ea5d..7d3b0a8 100644 --- a/src/share/data/dvc.md +++ b/src/share/data/dvc.md @@ -2,7 +2,8 @@ DVC (Data Version Control) is an open-source version control system specifically designed for machine learning (ML) and data science projects. It complements traditional version control systems like Git by focusing on managing the large datasets and machine learning models typically used in these domains. DVC provides a simple yet powerful set of tools to track changes, collaborate, and reproduce experiments with data and ML models efficiently. DVC allows you to version control datasets alongside your code. It stores lightweight metafiles in Git to track changes to datasets, while the actual data files are stored separately, typically in cloud storage or network-attached storage (NAS). -Your deployment of SharingHub may support DVC, and offer a DVC remote with an S3 storage behind. If a project uses DVC, you can see the [DVC button](../../explore/project-view.md#dvc) being displayed. To learn more about the usage of DVC, and our store remote, checkout the dedicated [tutorial](../../tutorials/manage_dataset_with_dvc.md). +Your deployment of SharingHub may support DVC, and offer a DVC remote with an S3 storage behind. If a project uses DVC, you can see the [DVC button](../../explore/project-view.md#dvc) being displayed. +- [:book: Tutorial "Dataset with DVC"](../../tutorials/dataset_with_dvc.md) - [๐Ÿ”— Official website](https://dvc.org/) - [๐Ÿ”— Documentation](https://dvc.org/doc) diff --git a/src/share/metadata/dataset-case.md b/src/share/examples/dataset-case.md similarity index 84% rename from src/share/metadata/dataset-case.md rename to src/share/examples/dataset-case.md index e8b5bd5..f9d00b0 100644 --- a/src/share/metadata/dataset-case.md +++ b/src/share/examples/dataset-case.md @@ -13,7 +13,7 @@ To share your dataset on the SharingHub, you need to set up your GitLab reposito To make your dataset usable by others, you need to create a `README.md` file. This file should begin with a YAML section describing your dataset's metadata, followed by a markdown section: - The markdown part of your README must contain all useful information about the dataset: how to use it and in what context, how it was created etc... -- The YAML section is delimited by three `---` at the top of your file and at the end of the section. It contains the metadata presented in the \[[Reference](./reference.md)]. +- The YAML section is delimited by three `---` at the top of your file and at the end of the section. It contains the metadata presented in the \[[Reference](../metadata/reference.md)]. ## Structure @@ -35,7 +35,7 @@ The repository tree: โ””โ”€โ”€ image.png ``` -You may notice the "dvc" extensions, this is because we use [DVC](../data/dvc.md) to store the files. Learn more in the tutorial ["Manage large dataset with DVC"](../../tutorials/manage_dataset_with_dvc.md). +You may notice the "dvc" extensions, this is because we use [DVC](../data/dvc.md) to store the files. Learn more in the tutorial ["Dataset with DVC"](../../tutorials/dataset_with_dvc.md). ## Metadata @@ -63,6 +63,6 @@ label: Letโ€™s break down the project's metadata. -- `assets`: define the files in the repository that we want to share with SharingHub. [[Ref](./reference.md#assets)] -- `gsd`: pure STAC property. [[Ref](./reference.md#remaining-properties)] -- `label`: a [STAC extension](https://github.com/stac-extensions/label), adapted to the dataset use-case. [[Ref](./reference.md#extensions)] +- `assets`: define the files in the repository that we want to share with SharingHub. [[Ref](../metadata/reference.md#assets)] +- `gsd`: pure STAC property. [[Ref](../metadata/reference.md#remaining-properties)] +- `label`: a [STAC extension](https://github.com/stac-extensions/label), adapted to the dataset use-case. [[Ref](../metadata/reference.md#extensions)] diff --git a/src/share/metadata/model-case.md b/src/share/examples/model-case.md similarity index 84% rename from src/share/metadata/model-case.md rename to src/share/examples/model-case.md index 8cdd957..aad4ee1 100644 --- a/src/share/metadata/model-case.md +++ b/src/share/examples/model-case.md @@ -13,7 +13,7 @@ To share your artificial intelligence model on the SharingHub, you need to set u To make your model usable by others, you need to create a `README.md` file. This file should begin with a YAML section describing your model's metadata, followed by a markdown section: - The markdown part of your README must contain all the elements needed to train and/or make an inference with your AI model! -- The YAML section is delimited by three `---` at the top of your file and at the end of the section. It contains the metadata presented in the \[[Reference](./reference.md)]. +- The YAML section is delimited by three `---` at the top of your file and at the end of the section. It contains the metadata presented in the \[[Reference](../metadata/reference.md)]. ## Structure @@ -75,8 +75,8 @@ ml-model: Letโ€™s break down the project's metadata. -- `title`: override the default title, which is the name of the GitLab project "Unet Building Footprint Segmentation Aerial Image". [[Ref](./reference.md#title)] -- `assets`: define the files in the repository that we want to share with SharingHub. [[Ref](./reference.md#assets)] -- `gsd` and `platform` are pure STAC properties. [[Ref](./reference.md#remaining-properties)] -- `providers`: override the default providers. [[Ref](./reference.md#providers)] -- `label` and `ml-model` are STAC extensions. Our model here uses the adapted STAC extensions with its use-case, [label](https://github.com/stac-extensions/label) and [ml-model](https://github.com/stac-extensions/ml-model). [[Ref](./reference.md#extensions)] +- `title`: override the default title, which is the name of the GitLab project "Unet Building Footprint Segmentation Aerial Image". [[Ref](../metadata/reference.md#title)] +- `assets`: define the files in the repository that we want to share with SharingHub. [[Ref](../metadata/reference.md#assets)] +- `gsd` and `platform` are pure STAC properties. [[Ref](../metadata/reference.md#remaining-properties)] +- `providers`: override the default providers. [[Ref](../metadata/reference.md#providers)] +- `label` and `ml-model` are STAC extensions. Our model here uses the adapted STAC extensions with its use-case, [label](https://github.com/stac-extensions/label) and [ml-model](https://github.com/stac-extensions/ml-model). [[Ref](../metadata/reference.md#extensions)] diff --git a/src/tutorials/dataset_with_dvc.md b/src/tutorials/dataset_with_dvc.md new file mode 100644 index 0000000..be8c746 --- /dev/null +++ b/src/tutorials/dataset_with_dvc.md @@ -0,0 +1,222 @@ +# Dataset with DVC + +## Introduction + +The most popular Version Control System, [Git](https://git-scm.com/), is not adapted for versioning +large amounts of data, it is not recommended to commit big files in your repository. + +When we need to version large amounts of data we use Data Version Control ([DVC](https://dvc.org/)), +that lets you capture the versions of files/directories in your Git repository, while storing them +on-premises or in cloud storage. Each DVC "commit" updates dvc-specific files, and these +modifications can be committed with Git. The real data is then versioned and stored with DVC, +while your Git repository references the "pointer" to this data. The result is a single history +for your source code and data that you can traverse โ€” a proper journal of your work! + +The DVC integration offered by SharingHub enables protected access to versioned data, +while respecting the management of data access rights carried out on GitLab, +making it the central point for information management. + +## Prerequisites + +### Install + +First, you must of course install DVC. + +Follow their documentation: [Installation](https://dvc.org/doc/install) + +### Git repository + +When you have access to the `dvc` command, you will need to use it in a Git repository. +You can create one for this tutorial, or use an existing one. + +Initialize a repository for the tutorial: + +```bash +git init example-dvc +cd example-dvc +touch README.md +git add README.md +git commit -m "Initial commit" +# replace with your own GitLab project url +git remote add origin https://gitlab.example.com/.git +git push --set-upstream origin main +``` + +### GitLab project (Optional) + +If you want to use SharingHub integration with DVC, you will need to push your +repository in GitLab. As described [here](../share/examples/dataset-case.md#configuration) +you must add the topic `sharinghub:dataset` to the project. + +## Setup DVC + +### Init + +The first step is to initialize the DVC configuration. + +```bash +dvc init +``` + +The configuration itself is not ignored by Git, as you need to share it with other users. +The authentication credentials on the other hand will be ignored for obvious security purposes. + +### Configure remote + +You will now need to configure a remote storage and the appropriate authentication. + +#### SharingHub + +You can use SharingHub as the remote storage of DVC. In your project page you can find +a [code generator](../explore/project-view.md#dvc) to help you for the setup. + +
+ ![Architecture](../assets/figures/explore/project-view/dvc-helper.png) +
DVC helper
+
+ +Copy the project's unique identifier (``) by connecting to the GitLab interface +via the URL `https://gitlab.example.com/`. + +![gitlab_project_id.png](../assets/figures/tutorials/gitlab-project-id.png) + +This ID is necessary to identify the storage path for DVC, you can continue the configuration. + +```bash +# replace with your sharinghub URL and the correct project ID +dvc remote add --default sharinghub https://sharinghub.example.com/api/store/ +dvc remote modify sharinghub auth custom +dvc remote modify sharinghub custom_auth_header 'X-Gitlab-Token' +``` + +You can commit and push the DVC configuration: + +```bash +git add . +git commit -m "Initialize DVC" +git push origin main +``` + +Finally, configure your authentication credentials with a GitLab access token, you will need +at least the `read_api` permission. + +```bash +dvc remote modify --local sharinghub password +``` + +#### S3 bucket + +!!! warning + Usage of a custom DVC remote such as your own S3 bucket will impact the easiness of sharing + for your project. To be more clear, access to that bucket will require the credentials of the + bucket, and it is not tied to our "GitLab-centered" philosophy. Be sure to address this + problem by properly documenting how to retrieve the credentials. + +You can alternatively chose to use an S3 bucket for the storage. In order to be able to use this +bucket for other repositories use a subpath in the bucket path. +It could be the project ID, path, name, slug etc... + +```bash +dvc remote add --default my-bucket s3:/// +dvc remote modify my-bucket endpointurl +``` + +Now commit and push the DVC configuration: + +```bash +git add . +git commit -m "Initialize DVC" +git push origin main +``` + +Configure the remote credentials with your S3 access key id and secret access key. + +```bash +dvc remote modify --local my-bucket access_key_id +dvc remote modify --local my-bucket secret_access_key +``` + +## Usage + +### Tracking data + +The use of DVC is simple, but because it it used alongside Git you must always be rigorous and not +forget to use it correctly. + +Let's pick a piece of data to work with. We'll create a file, `very_big_file.txt`, in the `data` directory. + +```bash +mkdir data +echo "Very big content" > data/very_big_file.txt +``` + +Use `dvc add` to start tracking the dataset: + +```bash +dvc add data +``` + +DVC stores information about the added file in a special `.dvc` file named `data.dvc`. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking. You can track files or directories. + +Next, run the following commands to track changes in Git: + +```bash +git add data.dvc .gitignore +git commit -m "Add data" +dvc push +git push +``` + +Now, we can modify the data. + +```bash +echo "Very big content but different" > data/very_big_file.txt +``` + +Update DVC tracking: + +```bash +dvc add data +``` + +You will notice that `data.dvc` was modified to reflect that the data changed. +To finalize the update, commit and push. + +```bash +git add data.dvc +git commit -m "Modify data" +dvc push +git push +``` + +By combining Git and DVC, if you go back to the previous commit you can synchronize the data +to the previous version. + +```bash +git checkout HEAD~1 +dvc checkout +``` + +### Retrieving data + +To retrieve the managed data: + +* Clone the Git project. + + ```bash + git clone https://gitlab.example.com/ + cd + ``` + +* Configure authentication credentials as described in the section [Configure remote](#configure-remote). + + ```bash + # Credentials setup for SharingHub integration + dvc remote modify --local sharinghub password + ``` + +* Download the data through `dvc pull`. + + ```bash + dvc pull + ``` diff --git a/src/tutorials/dataset_with_huggingface_interface.md b/src/tutorials/dataset_with_huggingface_interface.md new file mode 100644 index 0000000..e64f51a --- /dev/null +++ b/src/tutorials/dataset_with_huggingface_interface.md @@ -0,0 +1,187 @@ +# Dataset with HuggingFace interface + +## Introduction + +In this tutorial you will learn how to create a `dataset.py` for defining a HuggingFace +`datasets` interface. + +A dataset contains files, most likely divided in directories with a specific layout. +The Hugging Face's `datasets` library allows you to create and load custom datasets efficiently, +by offering a standardized Python API interface to interact with the data, +making it reusable and shareable. + +This guide explains how to define a `dataset.py` file to create a dataset compatible +with `datasets.load_dataset()`. + +## Prerequisites + +Before proceeding, make sure you have `datasets` installed: + +```bash +pip install datasets +``` + +## HuggingFace interface + +### Structure + +Your dataset will include the data itself and a dataset script named `dataset.py`. +This script defines how to download, load, and process data. + +The Structure of a `dataset.py` consists of: + +- Dataset configuration: class that defines configurations, inheriting from `datasets.BuilderConfig`. +- Dataset: The main dataset class, inheriting from `datasets.GeneratorBasedBuilder`. + +### Example `dataset.py` + +```python +import datasets + +_DESCRIPTION = """A custom image dataset.""" +_HOMEPAGE = "https://example.com" +_LICENSE = "MIT" +_CITATION = """@article{your_citation, title={Your Dataset} }""" + +class CustomDatasetConfig(datasets.BuilderConfig): + def __init__(self, **kwargs): + super(CustomDatasetConfig, self).__init__(**kwargs) + +class CustomDataset(datasets.GeneratorBasedBuilder): + BUILDER_CONFIGS = [ + CustomDatasetConfig(name="default", version=datasets.Version("1.0.0"), description="Default config") + ] + + def _info(self): + """Defines the dataset's schema, including features (columns) and metadata.""" + return datasets.DatasetInfo( + description=_DESCRIPTION, + features=datasets.Features({ + "image": datasets.Image(), + "label": datasets.ClassLabel(names=["cat", "dog"]), + }), + supervised_keys=("image", "label"), + homepage=_HOMEPAGE, + citation=_CITATION, + ) + + def _split_generators(self, dl_manager): + """Splits the dataset into predefined subsets (for example, train, validation, test).""" + data_dir = "path/to/your/data" + return [ + datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={"data_dir": data_dir}), + ] + + def _generate_examples(self, data_dir): + """Iterates over the dataset files and yields data samples in a structured format.""" + import os + for i, filename in enumerate(os.listdir(data_dir)): + label = "cat" if "cat" in filename else "dog" + yield i, {"image": os.path.join(data_dir, filename), "label": label} +``` + +### Usage + +The `datasets.load_dataset()` function interacts with `dataset.py` through a series of method +calls that define how data is loaded, processed, and structured. Hereโ€™s a step-by-step breakdown: + +1. **Detecting the Dataset Script:** + - When calling `load_dataset("path/to/dataset.py")`, the `datasets` library identifies and + imports the `CustomDataset` class defined in `dataset.py`. + +2. **Calling `_info()`:** + - The `_info()` method provides metadata about the dataset, including: + - **Features:** Defines the dataset structure (for example, images and labels). + - **Supervised Keys:** Specifies input-output pairs for supervised learning. + - **Dataset Citation & Homepage:** Provides dataset references. + +3. **Calling `_split_generators()`:** + - This method is responsible for splitting the dataset into predefined subsets + (for example, `train`, `validation`, `test`). + - The `dl_manager` argument can be used to download and extract files if needed. + - Each split is returned as a `datasets.SplitGenerator`, which provides parameters + to `_generate_examples()`. + +4. **Calling `_generate_examples()`:** + - This method iterates over the dataset files and yields structured examples. + - The function returns data in a format that matches the schema defined in `_info()`. + - Example: + + ```python + yield i, {"image": os.path.join(data_dir, filename), "label": label} + ``` + + - Each yielded sample becomes an entry in the final dataset. + +5. **Using Streaming Mode:** + - If the dataset is large and does not fit in memory, `datasets.load_dataset()` can be used to enable streaming mode: + + ```python + dataset = datasets.load_dataset("path/to/dataset.py", split="train", streaming=True) + ``` + + - When streaming is enabled, `_generate_examples()` is called. Samples are fetched and processed on demand rather than all at once. + - This is useful for large datasets stored remotely (e.g, in cloud storage) + - Using trust_remote_code=True with Streaming: If the dataset contains custom processing logic (e.g., special decoding functions), you might need to enable trust_remote_code=True: `dataset = load_dataset("path/to/dataset.py", split="train", streaming=True, trust_remote_code=True)` + +#### Load dataset: example + +```python +import datasets + +dataset = datasets.load_dataset("path/to/dataset.py", split="train", streaming=True, trust_remote_code=Treu) +print(dataset["train"][0]) # Access first training example +``` + +With this flow, `dataset.py` ensures that `datasets.load_dataset()` loads and structures data correctly, making it ready for model training and analysis. + +The `datasets.load_dataset()` function interacts with `dataset.py` as follows: + +- It **detects the dataset script** and loads the dataset class (`CustomDataset`). +- Calls `_info()` to retrieve dataset metadata. +- Calls `_split_generators()` to determine available dataset splits. +- Calls `_generate_examples()` to yield examples iteratively. + +#### Load dataset: Sen1Floods11-Dataset + +For [Sen1Floods11-Dataset](https://github.com/EOEPCA/Sen1Floods11-Dataset/): + +```python +from datasets import load_dataset +train_data = load_dataset( + "sen1floods11-dataset/sen1floods11_dataset.py", + split="train", + streaming=True, + trust_remote_code=True, + config_kwargs={ + "no_cache": False, + "context": "sen1floods11-dataset/", + }, +) +print(next(iter(train_data))) +``` + +### DVC integration + +It is possible to use DVC in the dataset.py file to pull your data: + +```python +from dvc.api import DVCFileSystem + +class CustomBuilderConfig(BuilderConfig): + def __init__(self, version="1.0.0", description=None, **kwargs): + super().__init__(version=version, description=description) + config = kwargs.get("config_kwargs") + self.context = config["context"] + +class CustomDataset(datasets.GeneratorBasedBuilder): + BUILDER_CONFIG_CLASS = CustomBuilderConfig + + def __init__(self, **kwargs): + super().__init__(**kwargs) + self.fs = DVCFileSystem(self.config.context) + + ... +``` + +You can instantiate a DVCFileSystem object so you can pull your data with the method `self.fs.read_bytes(dvc_path)`. diff --git a/src/tutorials/manage_dataset_with_dvc.md b/src/tutorials/manage_dataset_with_dvc.md deleted file mode 100644 index da402c5..0000000 --- a/src/tutorials/manage_dataset_with_dvc.md +++ /dev/null @@ -1,111 +0,0 @@ -# Manage large dataset with DVC - -## Introduction - -The management of large amounts of data is achieved through integration with DVC. -Data Version Control (DVC) lets you capture the versions of your data and models in Git commits, while storing them on-premises or in cloud storage. The result is a single history for data, code, and ML models that you can traverse โ€” a proper journal of your work! - -The DVC integration offered by SharingHub enables protected, high-performance access to data, while respecting the management of data access rights carried out on GitLab, making it the central point for information management. - -## Get Started with DVC - -### Prerequisites - -As a prerequisite for using DVC, you must have a Git repository initialized: - -```bash -mkdir example-dvc -cd example-dvc -git init -git remote add origin https://a:@gitlab.example.com/ -``` - -!!! note - You don't have to initially create your project in GitLab before executing these commands; these commands and next bellow will do it for you. - -### Initializing DVC with SharingHub - -Inside your chosen directory, we will use our current working directory as a DVC project. Let's initialize it by running dvc init inside a Git project: - -```bash -dvc init -git commit -m "Initialize DVC" -``` - -The following command will push the modifications into GitLab, create the project if necessary, and initialize rights management in SharingHub. - -```bash -git push --set-upstream origin main -``` - -You can now retrieve the project's unique identifier (``) by connecting to the GitLab interface via the URL `https://gitlab.example.com/`. - -![gitlab_project_id.png](../assets/figures/tutorials/gitlab-project-id.png) - -This id is necessary to identify the storage path for DVC, so you can continue the configuration. - -```bash -dvc remote add -d shstore https://sharinghub.example.com/api/store/ -dvc remote modify shstore auth custom -dvc remote modify shstore custom_auth_header 'X-Gitlab-Token' -dvc remote default shstore -git push -``` - -### Authenticate DVC with SharingHub - -Configure your authentication (will be only stored locally) - -```bash -dvc remote modify --local shstore password -``` - -### Tracking data as usual - -Working inside an initialized project directory, let's pick a piece of data to work with. We'll use an example `very_big_file.txt` file, in the `data` directory. - -```bash -mkdir data -echo "very big content" > data/very_big_file.txt -``` - -Use `dvc add` to start tracking the dataset file: - -```bash -dvc add data/very_big_file.txt -``` - -DVC stores information about the added file in a special `.dvc` file named `data/very_big_file.txt.dvc`. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking. - -Next, run the following commands to track changes in Git: - -```bash -git add data/very_big_file.txt.dvc data/.gitignore -git commit -m "Add raw data" -dvc push -git push -``` - -You can also directly add a complete directory: - -```bash -git add data/directory -git commit -m "Add a directory" -dvc push -git push -``` - -### Retrieving data - -To retrieve the managed data: - -* clone the GIT project -* configure authentication as described in the section `Authenticate DVC with SharingHub`. -* and download the data through `dvc pull`. - -```bash -git clone https://gitlab.example.com/ -cd example-dvc -dvc remote modify --local shstore password -dvc pull -```