Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): update storage docs #1931

Merged
merged 4 commits into from
Jan 9, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 61 additions & 63 deletions docs/en/5-Storage/AzureBlobStorage.md
Original file line number Diff line number Diff line change
@@ -1,104 +1,102 @@
# Overview
# Azure Blob Storage (Containers)

[Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction) is Microsoft's object storage solution for the cloud. Blob Storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.

Azure Blob Storage Containers are good at three things:

- Large amounts of data - Containers can be huge: way bigger than hard drives. And they are still fast.
- Accessible by multiple consumers at once - You can access the same data source from multiple Notebook Servers and pipelines at the same time without needing to duplicate the data.
- Sharing - Project namespaces can share a container. This is great for sharing data with people outside of your workspace.

# Setup
Azure Blob Storage Containers have the following advantages over Kubeflow Volumes (Disks):

1. **Capacity:** Containers can be huge: way bigger than hard drives. And they are still fast.
2. **Simultaneity:** You can access the same data source from multiple Notebook Servers and pipelines at the same time without needing to duplicate the data.
3. **Shareability:** Project namespaces can share a container. This is great for sharing data with people outside of your workspace.

<!-- prettier-ignore -->
!!! warning "Azure Blob Storage containers and buckets mount will be replacing the Minio Buckets and Minio storage mounts"
Users will be responsible for migrating data from Minio Buckets to the Azure Storage folders. For larger files, users may contact AAW for assistance.
!!! warning "Azure Blob Storage containers and buckets have replaced MinIO storage and buckets."
Users will be responsible for migrating data from MinIO Buckets to the Azure Storage folders. [Click here for instructions on how to migrate!](#how-to-migrate-from-minio-to-azure-blob-storage). For larger files, users may [contact AAW for assistance](https://statcan-aaw.slack.com).

## Blob Container Mounted on a Notebook Server
## Setup

<!-- prettier-ignore -->
### Accessing Blob Container from JupyterLab

The Blob CSI volumes are persisted under `/home/jovyan/buckets` when creating a Notebook Server. Files under `~/buckets` are backed by Blob storage. All AAW notebooks will have the `~/buckets` mounted to the file-system, making data accessible from everywhere.
The Blob CSI volumes are persisted under `~/buckets` when creating a Notebook Server. Files under `~/buckets` are backed by Blob storage. All AAW notebooks will have the `~/buckets` mounted to the file-system, making data accessible from everywhere.

![Blob folders mounted as Jupyter Notebook directories](../images/container-mount.png)
These folders can be used like any other - you can copy files to/from using the file browser, write from Python/R, etc. The only difference is that the data is being stored in the Blob storage container rather than on a local disk (and is thus accessible wherever you can access your Kubeflow notebook).

# Unclassified Notebook AAW folder mount
![Unclassified notebook folders mounted in Jupyter Notebook directories](../images/unclassified-mount.png)
![Blob folders mounted as directories](../images/container-mount.png)

# Protected-b Notebook AAW folder mount
![Protected-b notebooks mounted as Jupyter Notebook directories](../images/protectedb-mount.png)
#### Unclassified Containers

These folders can be used like any other - you can copy files to/from using the file browser, write from Python/R, etc. The only difference is that the data is being stored in the Blob storage container rather than on a local disk (and is thus accessible wherever you can access your Kubeflow notebook).
Unclassified blob storage containers will appear as follows in the `~/buckets` folder.

## How to Migrate from MinIO to Azure Blob Storage
![Unclassified notebook folders mounted as directories in JupyterLab](../images/unclassified-mount.png)

First, import the environmental variables stored in your secrets vault. You will either import from `minio-gateway` or `fdi-gateway` depending on where your data was ingested.
#### Protected B Containers

```
jovyan@rstudio-0:~$ source /vault/secrets/fdi-gateway-protected-b
```
Protected B blob storage containers will appear as follows in the `~/buckets` folder.

Then you create an alias to access your data.
![Protected B notebooks mounted as directories in JupyterLab](../images/protectedb-mount.png)

```
jovyan@rstudio-0:~$ mc alias set minio $MINIO_URL $MINIO_ACCESS_KEY $MINIO_SECRET_KEY
```
### Container Types

List the contents of your data folder with `mc ls`.
The following Blob containers are available. Accessing all Blob containers is the same. The difference between containers is the storage type behind them:

```
jovyan@rstudio-0:~$ mc ls minio
```
- **aaw-unclassified:** By default, use this one to store unclassified data.
- **aaw-protected-b:** Use this one to store sensitive, Protected B data.
- **aaw-unclassified-ro:** This classification is Protected B but read-only access. This is so users can view unclassified data within a Protected B notebook.

Finally, copy your MinIO data into your Azure Blob Storage directory with `mc cp --recursive`.
### Accessing Internal Data

```
jovyan@rstudio-0:~$ mc cp —-recursive minio ~/buckets/aaw-unclassified
```
Accessing internal data uses the DAS common storage connection which has use for internal and external users that require access to unclassified or Protected B data. The following containers can be provisioned:

If you have protected-b data, you can copy your data into the protected-b bucket.
- **external-unclassified:** Unclassified and accessible by both StatCan and non-Statcan employees.
- **external-protected-b:** Protected B and accessible by both StatCan and non-StatCan employees.
- **internal-unclassified:** Unclassified and accessible by Statcan employees, only.
- **internal-protected-b:** Protected B and accessible by StatCan employees, only.

```
jovyan@rstudio-0:~$ mc cp —-recursive minio ~/buckets/aaw-protected-b
```
The above containers follow the same convention as the AAW containers in terms of data, however there is a layer of isolation between StatCan employees and non-StatCan employees. Non-Statcan employees are only allowed in **external** containers, while StatCan employees can have access to any container.

AAW has an integration with the FAIR Data Infrastructure team that allows users to transfer unclassified and Protected B data to Azure Storage Accounts, thus allowing users to access this data from Notebook Servers.

<!-- prettier-ignore -->
Please reach out to the FAIR Data Infrastructure team if you have a use case for this data.

## Container Types
## Pricing

The following Blob containers are available:
<!-- prettier-ignore -->
!!! info "Pricing models are based on CPU and Memory usage"
Pricing is covered by KubeCost for user namespaces (In Kubeflow at the bottom of the Notebooks tab).

Accessing all Blob containers is the same. The difference between containers is the storage type behind them:
In general, Blob Storage is much cheaper than [Azure Manage Disks](https://azure.microsoft.com/en-us/pricing/details/managed-disks/) and has better I/O than managed SSD.

- **aaw-unclassified:** By default, use this one. Stores unclassified data.
## The Azure Storage Explorer

- **aaw-protected-b:** Stores sensitive protected-b data.
Our friends over at the Collaborative Analytics Environment (CAE) have some documentation on accessing your Azure Blob Storage from your AVD using the [Azure Storage Explorer](https://statcan.github.io/cae-eac/en/AzureStorageExplorer/).

- **aaw-unclassified-ro:** This classification is protected-b but read-only access. This is so users can view unclassified data within a protected-b notebook.
## How to Migrate from MinIO to Azure Blob Storage

<!-- prettier-ignore -->
First, `source` the environmental variables stored in your secrets vault. You will either `source` from **minio-gateway** or **fdi-gateway** depending on where your data was ingested:

## Accessing Internal Data
```
source /vault/secrets/fdi-gateway-protected-b
```

<!-- prettier-ignore -->
Accessing internal data uses the DAS common storage connection which has use for internal and external users that require access to unclassified or protected-b data. The following containers can be provisioned:
Then you create an alias to access your data:

- **external-unclassified**
- **external-protected-b**
- **internal-unclassified**
- **internal-protected-b**
```
mc alias set minio $MINIO_URL $MINIO_ACCESS_KEY $MINIO_SECRET_KEY
```

They follow the same convention as the AAW containers above in terms of data, however there is a layer of isolation between StatCan employees and non-StatCan employees. Non-Statcan employees are only allowed in **external** containers, while StatCan employees can have access to any container.
List the contents of your data folder with `mc ls`:

AAW has an integration with the FAIR Data Infrastructure team that allows users to transfer unclassified and protected-b data to Azure Storage Accounts, thus allowing users to access this data from Notebook Servers.
```
mc ls minio
```

Please reach out to the FAIR Data Infrastructure team if you have a use case for this data.
Finally, copy your MinIO data into your Azure Blob Storage directory with `mc cp --recursive`:

## Pricing
```
mc cp --recursive minio ~/buckets/aaw-unclassified
```

<!-- prettier-ignore -->
!!! info "Pricing models are based on CPU and Memory usage"
Pricing is covered by KubeCost for user namespaces (In Kubeflow at the bottom of the Notebooks tab).
If you have Protected B data, you can copy your data into the Protected B bucket:

In general, Blob Storage is much cheaper than [Azure Manage Disks](https://azure.microsoft.com/en-us/pricing/details/managed-disks/) and has better I/O than managed SSD.
```
mc cp --recursive minio ~/buckets/aaw-protected-b
```
36 changes: 16 additions & 20 deletions docs/en/5-Storage/Disks.md → docs/en/5-Storage/KubeflowVolumes.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Overview
# Kubeflow Volumes (Disks)

Disks are the familiar hard drive style file systems you're used to, provided to you from fast solid state drives (SSDs)!
Kubeflow Volumes are similar in concept to the hard disk drives you are used to on your Windows, Mac or Linux Desktop. Kubeflow Volumes are sometimes just called disks and are backed by fast solid state drives (SSDs) under the hood!

# Setup
## Setup

When creating your notebook server, you request disks by adding Data Volumes to your notebook server (pictured below, with go to `Advanced Options`). They are automatically mounted at the directory (`Mount Point`) you choose, and serve as a simple and reliable way to preserve data attached to a Notebook Server.

![Adding an existing volume to a new notebook server](../images/kubeflow_existing_volume.png)

<!-- prettier-ignore -->
??? warning "You pay for all disks you own, whether they're attached to a Notebook Server or not"
As soon as you create a disk, you're [paying](#pricing) for it until it is [deleted](#deleting-disk-storage), even if it's original Notebook Server is deleted. See [Deleting Disk Storage](#deleting-disk-storage) for more info
!!! Warning "You pay for all disks you own, whether they're attached to a Notebook Server or not."
As soon as you create a disk, you're [paying](#pricing) for it until it is [deleted](#deleting-disk-storage), even if it's original Notebook Server is deleted. See [Deleting Disk Storage](#deleting-disk-storage) for more info.

# Once you've got the basics ...
## Once you've got the basics...

When you delete your Notebook Server, your disks **are not deleted**. This let's you reuse that same disk (with all its contents) on a new Notebook Server later (as shown above with `Type = Existing` and the `Name` set to the volume you want to reuse). If you're done with the disk and it's contents, [delete it](#deleting-disk-storage).

Expand All @@ -25,22 +25,18 @@ To see your disks, check the Notebook Volumes section of the Notebook Server pag
## Pricing

<!-- prettier-ignore -->
??? info "Pricing models are tentative and may change"
??? info "Pricing models are tentative and may change."
As of writing, pricing is covered by the platform for initial users. This guidance explains how things are expected to be priced priced in future, but this may change.

When mounting a disk, you get an [Azure Managed Disk](https://azure.microsoft.com/en-us/pricing/details/managed-disks/). The **Premium SSD Managed Disks** pricing shows the cost per disk based on size. Note that you pay for the size of disk requested, not the amount of space you are currently using.

<!-- prettier-ignore -->
??? info "Tips to minimize costs"
As disks can be attached to a Notebook Server and reused, a typical usage pattern could be:

* At 9AM, create a Notebook Server (request 2CPU/8GB RAM and a 32GB attached
disk)
* Do work throughout the day, saving results to the attached disk
* At 5PM, shut down your Notebook Server to avoid paying for it overnight
* NOTE: The attached disk **is not destroyed** by this action
* At 9AM the next day, create a new Notebook Server and **attach your existing
disk**
* Continue your work...

This keeps all your work safe without paying for the computer when you're not using it
??? info "Tips to minimize costs."
You can minimize costs by suspending your notebook servers when not in use. A typical workflow may look like:

- Create a Notebook Server with the appropriate about of storage allocated to Workspace and Data Volumes.
- Do work throughout the day, saving results to the Data or Workspace Volume, depending on your needs.
- At the end of the workday, suspend your Notebook Server to avoid paying for it overnight.
- At 9AM the next day, resume your Notebook Server and continue your work.
- **Tip:** You can migrate your Workspace or Data Volume to a new notebook server without losing data as the destruction of the notebook server does not affect any attached Workspace or Data Volume.

40 changes: 27 additions & 13 deletions docs/en/5-Storage/Overview.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,32 @@
# Storage
# Aperçu du stockage

The platform provides several types of storage:
Plusieurs options de stockage sont disponibles pour les utilisateurs de l'EAA pour accéder et importer leurs données. Vous trouverez ci-dessous une description de chaque type. Il existe des pages de documentation distinctes pour la connexion à chaque type de stockage.

- Disk (also called Volumes on the Notebook Server creation screen)
- Containers (Azure Blob Storage)
- Data Lakes (coming soon)
## Volumes Kubeflow (Disques)

Depending on your use case, either disk or bucket may be most suitable:
Kubeflow utilise des disques virtuels appelés Volumes. Vous les rencontrerez sur l’écran de création de bloc-note server. Ces disques sont disponibles en deux variétés, appelées Workspace Volumes et Data Volumes. Les volumes de données et d'espace de travail peuvent être réutilisés et montés sur différents serveurs de bloc-note, mais pas en même temps.

| Type | Simultaneous Users | Speed | Total size | Shareable with Other Users |
| ---------------------------------: | -----------------------------------------------------------------------: | ------------------------------------------------------: | ------------------------ | ------------------------- |
| **[Disk](Disks.md)** | One machine/notebook server at a time | Fastest (throughput and latency) | <=512GB total per drive | No |
| **[Container (via Azure Blob Storage)](AzureBlobStorage.md)** | Simultaneous access from many machines/notebook servers at the same time | Fast-ish (Fast download, modest upload, modest latency) | Infinite (within reason) | [Yes] |
### Volumes de l'espace de travail

<!-- prettier-ignore -->
??? info "If you're unsure which to choose, don't sweat it"
These are guidelines, not an exact science - pick what sounds best now and run with it. The best choice for a complicated usage is non-obvious and often takes hands-on experience, so just trying something will help. For most situations both options work well even if they're not perfect, and remember that data can always be copied later if you change your mind.
Les volumes d'espace de travail sont similaires dans leur concept et leur fonction au disque dur de votre ordinateur portable, c'est là que tous les logiciels sont stockés et c'est le périphérique de stockage par défaut pour toutes vos affaires.

###Volumes de données

Si vous avez besoin de plus de stockage, un volume de données peut être nécessaire. Ceci est conceptuellement similaire à la connexion d'un disque dur USB haute capacité à votre PC.

## Stockage Blob Azure (Conteneurs)

Si vous avez besoin de collaborer avec d’autres, Azure Blob Storage (tel que fourni par FDI) peut être la meilleure option pour vous et vos données. Voir [Choisir votre stockage](#choosing-your-storage) pour plus d'informations.

## Choisir votre stockage

En fonction de vos besoins et exigences uniques et de ceux de votre projet, Kubeflow Volumes ou Azure Blob Storage (ou les deux) peuvent être les plus adaptés :

| Tapez | Simultanéité | Vitesse | Capacité | Partageabilité |
| :------------------------------------------------- -------- | :------------------------------------------------- ------------- | :------------------------------------------------- ----- | -------------------- | ------------- |
| **[Stockage Blob Azure (Conteneurs)](AzureBlobStorage.md)** | Accès simultané depuis plusieurs serveurs d'ordinateurs portables en même temps | Fast-ish (téléchargement rapide, téléchargement modeste, latence modeste) | Infini (dans la limite du raisonnable) | Partageable |
| **[Volumes Kubeflow (Disques)](KubeflowVolumes.md)** | Un serveur de bloc-note à la fois | Le plus rapide (débit et latence) | <=512 Go au total par disque | Non partageable |

<!-- plus joli-ignorer -->
!!! Info "Si vous ne savez pas lequel choisir, ne vous inquiétez pas !"
Ce sont des lignes directrices, pas une science exacte : choisissez ce qui sonne le mieux maintenant et appliquez-le. Le meilleur choix pour une utilisation compliquée n’est pas évident et nécessite souvent une expérience pratique, il suffit donc d’essayer quelque chose. Dans la plupart des situations, les deux options fonctionnent bien même si elles ne sont pas parfaites, et rappelez-vous que les données peuvent toujours être copiées plus tard si vous changez d'avis.
4 changes: 2 additions & 2 deletions docs/en/7-MLOps/Machine-Learning-Model-Cloud-Storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,9 @@ Depending on your use case, either disk or bucket may be most suitable. Our [sto

### Disks
<center>
[![Disks](../images/Disks.PNG)](../5-Storage/Disks.md)
[![Disks](../images/Disks.PNG)](../5-Storage/KubeflowVolumes.md)
</center>
**[Disks](../5-Storage/Disks.md)** are added to your notebook server by adding Data Volumes.
**[Disks](../5-Storage/KubeflowVolumes.md)** are added to your notebook server by adding Data Volumes.

### Data Lakes (Coming Soon)

Expand Down
Binary file modified docs/en/images/container-mount.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/en/images/protectedb-mount.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/en/images/unclassified-mount.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading