Skip to content

Latest commit

 

History

History
121 lines (84 loc) · 5.87 KB

configuring-environment-gke-a3-ultra.md

File metadata and controls

121 lines (84 loc) · 5.87 KB

Configuring the environment for running benchmark recipes on a GKE Cluster with A3 Ultra Node Pools

This guide outlines the steps to configure the environment required to run benchmark recipes on a Google Kubernetes Engine (GKE) cluster with A3 Ultra node pools.

Prerequisites

Before you begin, ensure you have completed the following:

  1. Create a Google Cloud project with billing enabled.

    a. To create a project, see Creating and managing projects. b. To enable billing, see Verify the billing status of your projects.

  2. Enabled the following APIs:

  3. Requested enough GPU quotas. Each a3-ultragpu-8g machine has 8 H200 GPUs attached.

  4. To view quotas, see View the quotas for your project. In the Filter field, select Dimensions(e.g location) and specify gpu_family:NVIDIA_H200.

  5. If you don't have enough quota, request a higher quota.

The environment

The environment comprises of the following components:

  • Client workstation: this is used to prepare, submit, and monitor ML workloads.
  • Google Cloud Storage (GCS) Bucket: used for storing datasets and logs.
  • Artifact Registry: serves as a private container registry for storing and managing Docker images used in the deployment.
  • Google Kubernetes Engine (GKE) Cluster with A3 Ultra Node Pools: provides a managed Kubernetes environment to run benchmark recipes.

Set up the client workstation

You have two options, you can use either a local machine or Google Cloud Shell.

Google Cloud Shell

We recommend using Google Cloud Shell as it comes with all necessary components pre-installed.

Local client

If you prefer to use your local machine, ensure your local machine has the following components installed.

  1. Google Cloud SDK. To install, see Install the gcloud CLI.
  2. kubectl. To install, see the kuberenetes documentation.
  3. Helm. To install, see the Helm documentation.
  4. Docker. To install, see the Docker documentation.

Set up a Google Cloud Storage bucket

gcloud storage buckets create gs://<BUCKET_NAME> --location=<BUCKET_LOCATION> --no-public-access-prevention

Replace the following:

  • BUCKET_NAME: the name of your bucket. The name must comply with the Cloud Storage bucket naming conventions.
  • BUCKET_LOCATION: the location of your bucket. The bucket must be located in the same region as the GKE cluster.

Set up an Artifact Registry

  • If you use Cloud KMS for repository encryption, create your artifact registry by using the instructions here.

  • If you don't use Cloud KMS, you can create your repository by using the following command:

      gcloud artifacts repositories create <REPOSITORY> \
          --repository-format=docker \
          --location=<LOCATION> \
          --description="<DESCRIPTION>" \

    Replace the following:

    • REPOSITORY: the name of the repository. For each repository location in a project, repository names must be unique.
    • LOCATION: the regional or multi-regional location for the repository. You can omit this flag if you set a default region.
    • DESCRIPTION: a description of the repository. Don't include sensitive data because repository descriptions are not encrypted.

Create a GKE Cluster with A3 Ultra Node Pools

Follow this guide for detailed instructions to create a GKE cluster with A3 Ultra node pools, GPUDirect-RDMA and required GPU driver versions.

The documentation uses Cluster Toolkit to create your GKE cluster quickly while incorporating best practices:

  • Creation of the necessary VPC networks and subnets.
  • Creation of a GKE cluster with multi-networking enabled.
  • Creation of an A3 Ultra node pool with NVIDIA H200 GPUs.
  • Installation of the required components for GPUDirect-RDMA and NCCL plugin.

What's next

Once you have set up your GKE cluster with A3 Ultra node pools, you can proceed to deploy and run your benchmark recipes.

Get Help

If you encounter any issues or have questions about this setup, use one of the following resources:

  • Consult the official GKE documentation.
  • Check the issues section of this repository for known problems and solutions.
  • Reach out to Google Cloud support.