Add support to emit metric to the target AMP #486

weicongw · 2024-09-24T22:01:32Z

Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
Beta

Issue #, if available:

Description of changes:

Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
The test support emitting metric from cross account cross region
If amp url is not set, the test will not emitting metics
Emit NCCL test avg bus bandwith metric
Add metadata label to the metric
Add/update readme

Test

go test -timeout 60m -v . -args -nvidiaTestImage public.ecr.aws/o5d5x8n6/weicongw:nvidia --efaEnabled=true --feature=multi-node --ampMetricUrl=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/remote_write --ampMetricRoleArn=arn:aws:iam::665181186642:role/amp
...
        [1,0]<stdout>:# Out of bounds values : 0 OK
        [1,0]<stdout>:# Avg bus bandwidth    : 3.68456 
        [1,0]<stdout>:#
        [1,0]<stdout>:
        
    mpi_test.go:145: Emitting nccl test metrics to AMP

Query the metric from AMP

export AMP_QUERY_ENDPOINT=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/query

awscurl -X POST --region us-west-2 \
--service aps "${AMP_QUERY_ENDPOINT}" \
-d 'query=nccl_average_bandwidth_gbps[60m]' \
--header 'Content-Type: application/x-www-form-urlencoded'

{"status":"success","data":{"resultType":"matrix","result":[{"metric":
{"__name__":"nccl_average_bandwidth_gbps","ami_id":"ami-0cd7612ff47454cd6",
"aws_ofi_nccl_version":"1.9.1","efa_count":"1","efa_enabled":"true",
"efa_installer_version":"1.34.0","instance_type":"p4de.24xlarge",
"kubernetes_version":"1.30+","nccl_version":"2.18.5","node_count":"2",
"nvidia_driver_version":"550.90.07","os_type":"Amazon Linux 2"},
"values":[[1726791286.534,"3.62432"],[1726794564.87,"3.68456"]]}]}}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

e2e2/internal/utils/aws.go

cartermckinnon · 2024-09-24T23:13:45Z

e2e2/internal/metric/metric.go

+}
+
+// PushMetricsToAMP pushes metric data to AWS Managed Prometheus (AMP) using SigV4 authentication
+func (m *MetricManager) PushMetricsToAMP(name string, help string, value float64) error {


batching the samples would be preferable to making a separate call to the remote_write API for every sample we collect, IMO

are you not able to use the upstream remote write client because of the assume-role jump? https://github.com/prometheus/prometheus/blob/5037cf75f2d4f1671ad365ba1e99902fc36808d5/storage/remote/client.go#L180

For the first point, that sounds good—I’ll change it in the next revision. As for the second point, I spent some time trying to use the remote write client, but I wasn’t able to integrate it into my code.

e2e2/internal/metric/metric.go

e2e2/internal/utils/aws.go

cartermckinnon · 2024-09-24T23:28:02Z

e2e2/test/cases/nvidia/main_test.go

+		return nil, fmt.Errorf("no nodes found in the cluster")
+	}
+
+	// Get instance type and metadata from the first node


the test case shouldn't really assume that all the nodes in the cluster are the same across all these dimensions; can you pass in the dimensions with your sample, instead of fetching them ahead of time? Then you'd be able to pass dimensions that you know match the sample

Updated in the latest revision

cartermckinnon · 2024-09-24T23:32:20Z

e2e2/test/images/nvidia/Dockerfile

+CMD echo "EFA Installer Version: $EFA_INSTALLER_VERSION" && \
+    echo "NCCL Version: $NCCL_VERSION" && \
+    echo "AWS OFI NCCL Version: $AWS_OFI_NCCL_VERSION" && \
+    printf "NVIDIA Driver Version: " && \
+    nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1


I think this would be better suited for an ENTRYPOINT script that logged this info and then ran whatever CMD was used

Updated in the latest revision

cartermckinnon · 2024-09-24T23:34:02Z

e2e2/test/cases/nvidia/main_test.go

+		"os_type":            osType,
+	}
+
+	// Create a job to fetch the logs of meta info


seems like you could just log these details in your actual test run instead of using a separate pod to print them

I couldn't, I tried to add ENTRYPOINT in my dockerfile, but the nccl test pods doesn't print these details.

Wondering the same..the launcher pods or worker pods should have these details right as they also run the entrypoint script ?

cartermckinnon · 2024-09-24T23:36:28Z

e2e2/test/cases/nvidia/README.md

+### Enter the Kubetest2 Container
+
+```bash
+docker run --name kubetest2 -d -i -t kubetest2 /bin/sh
+docker exec -it kubetest2 sh


I would just build the deployer and e2e-nvidia binary locally, would be simpler + faster during dev

Isn't the Kubetest2 the deployer?

Issacwww · 2024-09-25T04:50:22Z

e2e2/test/cases/nvidia/mpi_test.go

+	job := &batchv1.Job{
+		ObjectMeta: metav1.ObjectMeta{
+			Name:      "metadata-job",
+			Namespace: "default",
+		},
+		Spec: batchv1.JobSpec{
+			Template: v1.PodTemplateSpec{
+				Spec: v1.PodSpec{
+					RestartPolicy: v1.RestartPolicyNever,
+					Containers: []v1.Container{
+						{
+							Name:            "metadata-job",
+							Image:           *nvidiaTestImage,
+							ImagePullPolicy: v1.PullAlways,
+							Resources: v1.ResourceRequirements{
+								Limits: v1.ResourceList{
+									"nvidia.com/gpu":        node.Status.Capacity["nvidia.com/gpu"],
+									"vpc.amazonaws.com/efa": node.Status.Capacity["vpc.amazonaws.com/efa"],
+								},
+							},
+						},
+					},
+				},
+			},
+		},
+	}


I think we can use a template here to reduce the function size

Updated in new rev

…rometheus workspace

Pavani-Panakanti · 2024-11-13T20:19:12Z

e2e2/test/cases/nvidia/mpi_test.go

+		ObjectMeta: metav1.ObjectMeta{Name: "metadata-job", Namespace: "default"},
+	}
+	err = wait.For(fwext.NewConditionExtension(cfg.Client().Resources()).JobSucceeded(job),
+		wait.WithContext(ctx))


Can we add some comments around - purpose of this job and what it running

Pavani-Panakanti · 2024-11-13T20:23:40Z

e2e2/test/images/nvidia/Dockerfile

-ARG EFA_INSTALLER_VERSION=latest
+# Add ENV to make ARG values available at runtime
+ARG EFA_INSTALLER_VERSION=1.34.0
+ARG NCCL_VERSION=2.18.5


Why are we using an older version of nccl here ? General recommendation is to use either of last 2 releases (preferred n-1 as latest might have issues)

Pavani-Panakanti · 2024-11-13T20:27:42Z

e2e2/internal/metric/metric.go

+)
+
+type MetricManager struct {
+	// Metadata              map[string]string


Are we planning to use this field later ?

Pavani-Panakanti · 2024-11-13T20:28:49Z

@weicongw Can we also rebase the PR ? Thanks

weicongw marked this pull request as ready for review September 24, 2024 22:01

Issacwww reviewed Sep 24, 2024

View reviewed changes

e2e2/internal/utils/aws.go Outdated Show resolved Hide resolved

cartermckinnon reviewed Sep 24, 2024

View reviewed changes

weicongw force-pushed the new-unit branch from 55736b1 to c29e969 Compare September 25, 2024 01:31

Issacwww reviewed Sep 25, 2024

View reviewed changes

Add support to emit metric to the target Amazon Managed Service for P…

99ad368

…rometheus workspace

weicongw force-pushed the new-unit branch from c29e969 to 99ad368 Compare September 25, 2024 21:15

Pavani-Panakanti reviewed Nov 13, 2024

View reviewed changes

Add support to emit metric to the target AMP #486

Are you sure you want to change the base?

Add support to emit metric to the target AMP #486

Uh oh!

Conversation

weicongw commented Sep 24, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pavani-Panakanti Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pavani-Panakanti Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Pavani-Panakanti commented Nov 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Pavani-Panakanti Nov 13, 2024 •

edited

Loading

Pavani-Panakanti Nov 13, 2024 •

edited

Loading