Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitleaksignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3340688aefb8708b0d945abccaeb49e44932a64d:src/tutorials/manage_dataset_with_shell.md:curl-auth-header:122
11 changes: 6 additions & 5 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -110,11 +110,8 @@ nav:
- Metadata:
- share/metadata/index.md
- share/metadata/configuration.md
- share/metadata/reference.md
- Examples:
- share/metadata/model-case.md
- share/metadata/dataset-case.md
- share/metadata/stac-concepts.md
- share/metadata/reference.md
- Data:
- share/data/index.md
- Git LFS: share/data/git-lfs.md
Expand All @@ -123,8 +120,12 @@ nav:
- share/ml-tracking/index.md
- GitLab Integration: share/ml-tracking/gitlab.md
- SharingHub Integration: share/ml-tracking/mlflow-sharinghub.md
- Examples:
- share/examples/model-case.md
- share/examples/dataset-case.md
- Tutorials:
- tutorials/manage_dataset_with_dvc.md
- tutorials/dataset_with_dvc.md
- tutorials/dataset_with_huggingface_interface.md
- Resources:
- resources/index.md
- resources/stac.md
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
mkdocs~=1.5
mkdocs-macros-plugin~=1.0
mkdocs-material~=9.5
mkdocs-swagger-ui-tag==0.6.10
mkdocs-swagger-ui-tag==0.6.11
2 changes: 1 addition & 1 deletion src/explore/project-view.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The STAC helper display download code with the STAC API by using Python `request
### DVC

The DVC button displays configuration and setup of DVC for the project.
The DVC service enables you to store large volumes of data. The DVC button is therefore only displayed for projects in the "Datasets" category, which have a DVC configuration in the source GitLab project. This button displays the remote DVC configuration link and additional information on the DVC [documentation](https://dvc.org/doc) and [tutorial](../tutorials/manage_dataset_with_dvc.md).
The DVC service enables you to store large volumes of data. The DVC button is therefore only displayed for projects in the "Datasets" category, which have a DVC configuration in the source GitLab project. This button displays the remote DVC configuration link and additional information on the DVC [documentation](https://dvc.org/doc) and [tutorial](../tutorials/dataset_with_dvc.md).

![DVC helper](../assets/figures/explore/project-view/dvc-helper.png)

Expand Down
3 changes: 2 additions & 1 deletion src/share/data/dvc.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

DVC (Data Version Control) is an open-source version control system specifically designed for machine learning (ML) and data science projects. It complements traditional version control systems like Git by focusing on managing the large datasets and machine learning models typically used in these domains. DVC provides a simple yet powerful set of tools to track changes, collaborate, and reproduce experiments with data and ML models efficiently. DVC allows you to version control datasets alongside your code. It stores lightweight metafiles in Git to track changes to datasets, while the actual data files are stored separately, typically in cloud storage or network-attached storage (NAS).

Your deployment of SharingHub may support DVC, and offer a DVC remote with an S3 storage behind. If a project uses DVC, you can see the [DVC button](../../explore/project-view.md#dvc) being displayed. To learn more about the usage of DVC, and our store remote, checkout the dedicated [tutorial](../../tutorials/manage_dataset_with_dvc.md).
Your deployment of SharingHub may support DVC, and offer a DVC remote with an S3 storage behind. If a project uses DVC, you can see the [DVC button](../../explore/project-view.md#dvc) being displayed.

- [:book: Tutorial "Dataset with DVC"](../../tutorials/dataset_with_dvc.md)
- [🔗 Official website](https://dvc.org/)
- [🔗 Documentation](https://dvc.org/doc)
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ To share your dataset on the SharingHub, you need to set up your GitLab reposito
To make your dataset usable by others, you need to create a `README.md` file. This file should begin with a YAML section describing your dataset's metadata, followed by a markdown section:

- The markdown part of your README must contain all useful information about the dataset: how to use it and in what context, how it was created etc...
- The YAML section is delimited by three `---` at the top of your file and at the end of the section. It contains the metadata presented in the \[[Reference](./reference.md)].
- The YAML section is delimited by three `---` at the top of your file and at the end of the section. It contains the metadata presented in the \[[Reference](../metadata/reference.md)].

## Structure

Expand All @@ -35,7 +35,7 @@ The repository tree:
└── image.png
```

You may notice the "dvc" extensions, this is because we use [DVC](../data/dvc.md) to store the files. Learn more in the tutorial ["Manage large dataset with DVC"](../../tutorials/manage_dataset_with_dvc.md).
You may notice the "dvc" extensions, this is because we use [DVC](../data/dvc.md) to store the files. Learn more in the tutorial ["Dataset with DVC"](../../tutorials/dataset_with_dvc.md).

## Metadata

Expand Down Expand Up @@ -63,6 +63,6 @@ label:

Let’s break down the project's metadata.

- `assets`: define the files in the repository that we want to share with SharingHub. [[Ref](./reference.md#assets)]
- `gsd`: pure STAC property. [[Ref](./reference.md#remaining-properties)]
- `label`: a [STAC extension](https://github.com/stac-extensions/label), adapted to the dataset use-case. [[Ref](./reference.md#extensions)]
- `assets`: define the files in the repository that we want to share with SharingHub. [[Ref](../metadata/reference.md#assets)]
- `gsd`: pure STAC property. [[Ref](../metadata/reference.md#remaining-properties)]
- `label`: a [STAC extension](https://github.com/stac-extensions/label), adapted to the dataset use-case. [[Ref](../metadata/reference.md#extensions)]
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ To share your artificial intelligence model on the SharingHub, you need to set u
To make your model usable by others, you need to create a `README.md` file. This file should begin with a YAML section describing your model's metadata, followed by a markdown section:

- The markdown part of your README must contain all the elements needed to train and/or make an inference with your AI model!
- The YAML section is delimited by three `---` at the top of your file and at the end of the section. It contains the metadata presented in the \[[Reference](./reference.md)].
- The YAML section is delimited by three `---` at the top of your file and at the end of the section. It contains the metadata presented in the \[[Reference](../metadata/reference.md)].

## Structure

Expand Down Expand Up @@ -75,8 +75,8 @@ ml-model:

Let’s break down the project's metadata.

- `title`: override the default title, which is the name of the GitLab project "Unet Building Footprint Segmentation Aerial Image". [[Ref](./reference.md#title)]
- `assets`: define the files in the repository that we want to share with SharingHub. [[Ref](./reference.md#assets)]
- `gsd` and `platform` are pure STAC properties. [[Ref](./reference.md#remaining-properties)]
- `providers`: override the default providers. [[Ref](./reference.md#providers)]
- `label` and `ml-model` are STAC extensions. Our model here uses the adapted STAC extensions with its use-case, [label](https://github.com/stac-extensions/label) and [ml-model](https://github.com/stac-extensions/ml-model). [[Ref](./reference.md#extensions)]
- `title`: override the default title, which is the name of the GitLab project "Unet Building Footprint Segmentation Aerial Image". [[Ref](../metadata/reference.md#title)]
- `assets`: define the files in the repository that we want to share with SharingHub. [[Ref](../metadata/reference.md#assets)]
- `gsd` and `platform` are pure STAC properties. [[Ref](../metadata/reference.md#remaining-properties)]
- `providers`: override the default providers. [[Ref](../metadata/reference.md#providers)]
- `label` and `ml-model` are STAC extensions. Our model here uses the adapted STAC extensions with its use-case, [label](https://github.com/stac-extensions/label) and [ml-model](https://github.com/stac-extensions/ml-model). [[Ref](../metadata/reference.md#extensions)]
222 changes: 222 additions & 0 deletions src/tutorials/dataset_with_dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
# Dataset with DVC

## Introduction

The most popular Version Control System, [Git](https://git-scm.com/), is not adapted for versioning
large amounts of data, it is not recommended to commit big files in your repository.

When we need to version large amounts of data we use Data Version Control ([DVC](https://dvc.org/)),
that lets you capture the versions of files/directories in your Git repository, while storing them
on-premises or in cloud storage. Each DVC "commit" updates dvc-specific files, and these
modifications can be committed with Git. The real data is then versioned and stored with DVC,
while your Git repository references the "pointer" to this data. The result is a single history
for your source code and data that you can traverse — a proper journal of your work!

The DVC integration offered by SharingHub enables protected access to versioned data,
while respecting the management of data access rights carried out on GitLab,
making it the central point for information management.

## Prerequisites

### Install

First, you must of course install DVC.

Follow their documentation: [Installation](https://dvc.org/doc/install)

### Git repository

When you have access to the `dvc` command, you will need to use it in a Git repository.
You can create one for this tutorial, or use an existing one.

Initialize a repository for the tutorial:

```bash
git init example-dvc
cd example-dvc
touch README.md
git add README.md
git commit -m "Initial commit"
# replace with your own GitLab project url
git remote add origin https://gitlab.example.com/<project-path>.git
git push --set-upstream origin main
```

### GitLab project (Optional)

If you want to use SharingHub integration with DVC, you will need to push your
repository in GitLab. As described [here](../share/examples/dataset-case.md#configuration)
you must add the topic `sharinghub:dataset` to the project.

## Setup DVC

### Init

The first step is to initialize the DVC configuration.

```bash
dvc init
```

The configuration itself is not ignored by Git, as you need to share it with other users.
The authentication credentials on the other hand will be ignored for obvious security purposes.

### Configure remote

You will now need to configure a remote storage and the appropriate authentication.

#### SharingHub

You can use SharingHub as the remote storage of DVC. In your project page you can find
a [code generator](../explore/project-view.md#dvc) to help you for the setup.

<figure markdown>
![Architecture](../assets/figures/explore/project-view/dvc-helper.png)
<figcaption>DVC helper</figcaption>
</figure>

Copy the project's unique identifier (`<project_id>`) by connecting to the GitLab interface
via the URL `https://gitlab.example.com/<project-path>`.

![gitlab_project_id.png](../assets/figures/tutorials/gitlab-project-id.png)

This ID is necessary to identify the storage path for DVC, you can continue the configuration.

```bash
# replace with your sharinghub URL and the correct project ID
dvc remote add --default sharinghub https://sharinghub.example.com/api/store/<project_id>
dvc remote modify sharinghub auth custom
dvc remote modify sharinghub custom_auth_header 'X-Gitlab-Token'
```

You can commit and push the DVC configuration:

```bash
git add .
git commit -m "Initialize DVC"
git push origin main
```

Finally, configure your authentication credentials with a GitLab access token, you will need
at least the `read_api` permission.

```bash
dvc remote modify --local sharinghub password <your-personal-gitlab-token>
```

#### S3 bucket

!!! warning
Usage of a custom DVC remote such as your own S3 bucket will impact the easiness of sharing
for your project. To be more clear, access to that bucket will require the credentials of the
bucket, and it is not tied to our "GitLab-centered" philosophy. Be sure to address this
problem by properly documenting how to retrieve the credentials.

You can alternatively chose to use an S3 bucket for the storage. In order to be able to use this
bucket for other repositories use a subpath in the bucket path.
It could be the project ID, path, name, slug etc...

```bash
dvc remote add --default my-bucket s3://<bucket>/<project-identifier>
dvc remote modify my-bucket endpointurl <s3-endpoint-url>
```

Now commit and push the DVC configuration:

```bash
git add .
git commit -m "Initialize DVC"
git push origin main
```

Configure the remote credentials with your S3 access key id and secret access key.

```bash
dvc remote modify --local my-bucket access_key_id <access-key-id>
dvc remote modify --local my-bucket secret_access_key <secret-access-key>
```

## Usage

### Tracking data

The use of DVC is simple, but because it it used alongside Git you must always be rigorous and not
forget to use it correctly.

Let's pick a piece of data to work with. We'll create a file, `very_big_file.txt`, in the `data` directory.

```bash
mkdir data
echo "Very big content" > data/very_big_file.txt
```

Use `dvc add` to start tracking the dataset:

```bash
dvc add data
```

DVC stores information about the added file in a special `.dvc` file named `data.dvc`. This small, human-readable metadata file acts as a placeholder for the original data for the purpose of Git tracking. You can track files or directories.

Next, run the following commands to track changes in Git:

```bash
git add data.dvc .gitignore
git commit -m "Add data"
dvc push
git push
```

Now, we can modify the data.

```bash
echo "Very big content but different" > data/very_big_file.txt
```

Update DVC tracking:

```bash
dvc add data
```

You will notice that `data.dvc` was modified to reflect that the data changed.
To finalize the update, commit and push.

```bash
git add data.dvc
git commit -m "Modify data"
dvc push
git push
```

By combining Git and DVC, if you go back to the previous commit you can synchronize the data
to the previous version.

```bash
git checkout HEAD~1
dvc checkout
```

### Retrieving data

To retrieve the managed data:

* Clone the Git project.

```bash
git clone https://gitlab.example.com/<project-path>
cd <project-dir>
```

* Configure authentication credentials as described in the section [Configure remote](#configure-remote).

```bash
# Credentials setup for SharingHub integration
dvc remote modify --local sharinghub password <your-personal-gitlab-token>
```

* Download the data through `dvc pull`.

```bash
dvc pull
```
Loading
Loading