Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model training guide #1406

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
c6c2a73
initial commit + tf code
BRV158 Jul 8, 2024
636c0e8
document folder renamed
BRV158 Jul 9, 2024
1dd678a
GPU node pool added
BRV158 Jul 11, 2024
9b73def
Merge branch 'GoogleCloudPlatform:main' into data-load-strategy
ganochenkodg Aug 6, 2024
6b6d726
some updates
ganochenkodg Aug 8, 2024
aa79450
quickfix
ganochenkodg Aug 8, 2024
54c1f31
updates
ganochenkodg Aug 8, 2024
b8a0279
add sa
ganochenkodg Aug 8, 2024
5cddc1f
update
ganochenkodg Aug 11, 2024
7a45966
update notebook
ganochenkodg Aug 14, 2024
33d6d89
updates
ganochenkodg Aug 15, 2024
bbac023
update the code
ganochenkodg Aug 20, 2024
6a637e6
updates
ganochenkodg Aug 20, 2024
7c81a9e
quickfix
ganochenkodg Aug 20, 2024
be00cb6
quickfix
ganochenkodg Aug 20, 2024
6d4e556
updates
ganochenkodg Aug 21, 2024
bc0c262
regional tags added
BRV158 Aug 21, 2024
1166228
notebook cells explanation
BRV158 Aug 21, 2024
f9fff75
update the notebook
ganochenkodg Aug 22, 2024
c151919
fix
BRV158 Aug 22, 2024
33b3c2d
update the job
ganochenkodg Aug 22, 2024
595a01a
update the notebook
ganochenkodg Aug 22, 2024
2e0d880
new volume
ganochenkodg Aug 23, 2024
191a46c
dawnloading logic added
BRV158 Aug 27, 2024
a52a22a
separate download jobs
BRV158 Aug 29, 2024
c9dc544
ram-job fix
BRV158 Sep 3, 2024
59206d8
update the notebook
ganochenkodg Sep 12, 2024
0053b0b
Merge branch 'main' into data-load-strategy
ganochenkodg Sep 12, 2024
e9e1caf
model training squence edited
BRV158 Sep 13, 2024
d0043d2
dataset jobs renaming
BRV158 Sep 13, 2024
92b93fd
update headers
ganochenkodg Sep 16, 2024
f9520d1
Merge branch 'main' into data-load-strategy
ganochenkodg Sep 16, 2024
b9ecfdc
updates
ganochenkodg Sep 17, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions ai-ml/model-train/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Data backend options for model training jobs on GKE

These examples shows performance of different storages for model training purposes.
[Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine).

Visit [Google Cloud documentation](will be known after publishing)
to follow the tutorials.

47 changes: 47 additions & 0 deletions ai-ml/model-train/manifests/01-volumes/bucket.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest: Judging from the requirements listed here, we'll need a README.md file for the ai-ml/model-train folder. Could you please add a README.md file with a link to the cloud.google.com tutorial where these samples will be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, but link is empty, we don't know it until guide is published

Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [START gke_ai_ml_model_train_01_bucket]
apiVersion: v1
kind: PersistentVolume
metadata:
name: gcs-fuse-pv
spec:
accessModes:
- ReadWriteMany
capacity:
storage: 16Gi
storageClassName: example-storage-class
mountOptions:
- implicit-dirs
csi:
driver: gcsfuse.csi.storage.gke.io
volumeHandle: <PROJECT_ID>-<CLUSTER_PREFIX>-model-train
volumeAttributes:
fileCacheCapacity: 5Gi
fileCacheForRangeRead: "true"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: gcs-fuse-claim
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 16Gi
volumeName: gcs-fuse-pv
storageClassName: example-storage-class
# [END gke_ai_ml_model_train_01_bucket]
33 changes: 33 additions & 0 deletions ai-ml/model-train/manifests/01-volumes/cloudbuild.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [START gke_ai_ml_model_train_01_cloudbuild]
steps:
- name: 'gcr.io/google.com/cloudsdktool/cloud-sdk'
entrypoint: /bin/bash
args:
- '-c'
- |
gcloud compute ssh --tunnel-through-iap --quiet cloudbuild@${_INSTANCE_NAME} --zone=${_ZONE} --command="\
sudo mkdir -p /mnt/disks/ram-disk && \
sudo mount -t tmpfs -o size=16g tmpfs /mnt/disks/ram-disk && \
sudo mkfs.ext4 -F /dev/disk/by-id/google-local-ssd-block0 && \
sudo mkdir -p /mnt/disks/ssd0 && \
sudo mount /dev/disk/by-id/google-local-ssd-block0 /mnt/disks/ssd0 && \
sudo mkdir -p /mnt/disks/ssd0/outputs && \
sudo chmod -R 777 /mnt/disks/ssd0/outputs"
substitutions:
_ZONE: us-central1-a
_INSTANCE_NAME: model-train-vm
# [END gke_ai_ml_model_train_01_cloudbuild]
95 changes: 95 additions & 0 deletions ai-ml/model-train/manifests/01-volumes/volumes.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [START gke_ai_ml_model_train_01_volumes]
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-ssd-pv
spec:
capacity:
storage: 16Gi
accessModes: ["ReadWriteOnce"]
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /mnt/disks/ssd0
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: "node_pool"
operator: "In"
values:
- "model-train-pool"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: local-ssd-claim
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-storage
volumeName: local-ssd-pv
resources:
requests:
storage: 16Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: ram-disk-pv
spec:
capacity:
storage: 16Gi
accessModes: ["ReadWriteOnce"]
persistentVolumeReclaimPolicy: Retain
storageClassName: local-storage
local:
path: /mnt/disks/ram-disk
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: "node_pool"
operator: "In"
values:
- "model-train-pool"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: ram-disk-claim
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-storage
volumeName: ram-disk-pv
resources:
requests:
storage: 16Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pd-ssd-claim
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 16Gi
storageClassName: premium-rwo
# [END gke_ai_ml_model_train_01_volumes]
109 changes: 109 additions & 0 deletions ai-ml/model-train/manifests/02-dataset/bucket-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# [START gke_ai_ml_model_train_02_data_load_job]
apiVersion: v1
kind: ServiceAccount
metadata:
name: bucket-access
---
apiVersion: v1
kind: ConfigMap
metadata:
name: download-script
data:
download.sh: |-
#!/usr/bin/bash -x
apt-get update -y && \
apt-get install -y --no-install-recommends \
git git-lfs rsync
git lfs install
cd /tmp
echo "Saving dataset into tmp..."
time git clone --depth=1 "$DATASET_REPO"; echo "cloned"
if [ "$UPLOAD_SSD" == "1" ]; then
echo "Saving dataset into Local SSD..."
time rsync --info=progress2 -a /tmp/dataset/dataset/ /local-ssd/dataset/
fi
if [ "$UPLOAD_RAM" == "1" ]; then
echo "Saving dataset into Ram disk..."
time rsync --info=progress2 -a /tmp/dataset/dataset/ /ram-disk/dataset/
fi
if [ "$UPLOAD_PD" == "1" ]; then
echo "Saving dataset into Persistent disk..."
time rsync --info=progress2 -a /tmp/dataset/dataset/ /pd-ssd/dataset/
fi
if [ "$UPLOAD_BUCKET" == "1" ]; then
echo "Saving dataset into Bucket..."
time gsutil -q -m cp -r /tmp/dataset/dataset/ gs://$BUCKET_NAME/
echo "Dataset was successfully saved in all storages!"
fi
---
apiVersion: batch/v1
kind: Job
metadata:
name: bucket-dataset-downloader
labels:
app: bucket-dataset-downloader
spec:
ttlSecondsAfterFinished: 120
template:
metadata:
labels:
app: bucket-dataset-downloader
spec:
restartPolicy: OnFailure
serviceAccountName: bucket-access
containers:
- name: gcloud
image: gcr.io/google.com/cloudsdktool/google-cloud-cli:slim
resources:
requests:
cpu: "1"
memory: "3Gi"
limits:
cpu: "2"
memory: "3Gi"
command:
- /scripts/download.sh
env:
- name: UPLOAD_BUCKET
value: "1"
- name: BUCKET_NAME
value: <PROJECT_ID>-<CLUSTER_PREFIX>-model-train
- name: DATASET_REPO
value: "https://huggingface.co/datasets/dganochenko/dataset"
- name: TIMEFORMAT
value: "%0lR"
volumeMounts:
- name: scripts-volume
mountPath: "/scripts/"
readOnly: true
volumes:
- name: scripts-volume
configMap:
defaultMode: 0700
name: download-script
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: NoSchedule
- key: "app.stateful/component"
operator: "Equal"
value: "model-train"
effect: NoSchedule
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
# [END gke_ai_ml_model_train_02_data_load_job]
Loading