Skip to content

Conversation

@weicongw
Copy link
Contributor

Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
Beta

Issue #, if available:

Description of changes:

  • Add support to emit metric to the target Amazon Managed Service for Prometheus workspace
  • The test support emitting metric from cross account cross region
  • If amp url is not set, the test will not emitting metics
  • Emit NCCL test avg bus bandwith metric
  • Add metadata label to the metric
  • Add/update readme

Test

go test -timeout 60m -v . -args -nvidiaTestImage public.ecr.aws/o5d5x8n6/weicongw:nvidia --efaEnabled=true --feature=multi-node --ampMetricUrl=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/remote_write --ampMetricRoleArn=arn:aws:iam::665181186642:role/amp
...
        [1,0]<stdout>:# Out of bounds values : 0 OK
        [1,0]<stdout>:# Avg bus bandwidth    : 3.68456 
        [1,0]<stdout>:#
        [1,0]<stdout>:
        
    mpi_test.go:145: Emitting nccl test metrics to AMP

Query the metric from AMP

export AMP_QUERY_ENDPOINT=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/ws-9f8fe538-f707-46e7-863c-26bfb192dc52/api/v1/query

awscurl -X POST --region us-west-2 \
--service aps "${AMP_QUERY_ENDPOINT}" \
-d 'query=nccl_average_bandwidth_gbps[60m]' \
--header 'Content-Type: application/x-www-form-urlencoded'

{"status":"success","data":{"resultType":"matrix","result":[{"metric":
{"__name__":"nccl_average_bandwidth_gbps","ami_id":"ami-0cd7612ff47454cd6",
"aws_ofi_nccl_version":"1.9.1","efa_count":"1","efa_enabled":"true",
"efa_installer_version":"1.34.0","instance_type":"p4de.24xlarge",
"kubernetes_version":"1.30+","nccl_version":"2.18.5","node_count":"2",
"nvidia_driver_version":"550.90.07","os_type":"Amazon Linux 2"},
"values":[[1726791286.534,"3.62432"],[1726794564.87,"3.68456"]]}]}}

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@weicongw weicongw marked this pull request as ready for review September 24, 2024 22:01
}

// PushMetricsToAMP pushes metric data to AWS Managed Prometheus (AMP) using SigV4 authentication
func (m *MetricManager) PushMetricsToAMP(name string, help string, value float64) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. batching the samples would be preferable to making a separate call to the remote_write API for every sample we collect, IMO

  2. are you not able to use the upstream remote write client because of the assume-role jump? https://github.com/prometheus/prometheus/blob/5037cf75f2d4f1671ad365ba1e99902fc36808d5/storage/remote/client.go#L180

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first point, that sounds good—I’ll change it in the next revision. As for the second point, I spent some time trying to use the remote write client, but I wasn’t able to integrate it into my code.

return nil, fmt.Errorf("no nodes found in the cluster")
}

// Get instance type and metadata from the first node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test case shouldn't really assume that all the nodes in the cluster are the same across all these dimensions; can you pass in the dimensions with your sample, instead of fetching them ahead of time? Then you'd be able to pass dimensions that you know match the sample

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the latest revision

Comment on lines 108 to 112
CMD echo "EFA Installer Version: $EFA_INSTALLER_VERSION" && \
echo "NCCL Version: $NCCL_VERSION" && \
echo "AWS OFI NCCL Version: $AWS_OFI_NCCL_VERSION" && \
printf "NVIDIA Driver Version: " && \
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -n 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better suited for an ENTRYPOINT script that logged this info and then ran whatever CMD was used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the latest revision

"os_type": osType,
}

// Create a job to fetch the logs of meta info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you could just log these details in your actual test run instead of using a separate pod to print them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't, I tried to add ENTRYPOINT in my dockerfile, but the nccl test pods doesn't print these details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering the same..the launcher pods or worker pods should have these details right as they also run the entrypoint script ?

Comment on lines +26 to +30
### Enter the Kubetest2 Container

```bash
docker run --name kubetest2 -d -i -t kubetest2 /bin/sh
docker exec -it kubetest2 sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just build the deployer and e2e-nvidia binary locally, would be simpler + faster during dev

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the Kubetest2 the deployer?

Comment on lines 284 to 299
job := &batchv1.Job{
ObjectMeta: metav1.ObjectMeta{
Name: "metadata-job",
Namespace: "default",
},
Spec: batchv1.JobSpec{
Template: v1.PodTemplateSpec{
Spec: v1.PodSpec{
RestartPolicy: v1.RestartPolicyNever,
Containers: []v1.Container{
{
Name: "metadata-job",
Image: *nvidiaTestImage,
ImagePullPolicy: v1.PullAlways,
Resources: v1.ResourceRequirements{
Limits: v1.ResourceList{
"nvidia.com/gpu": node.Status.Capacity["nvidia.com/gpu"],
"vpc.amazonaws.com/efa": node.Status.Capacity["vpc.amazonaws.com/efa"],
},
},
},
},
},
},
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use a template here to reduce the function size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in new rev

ObjectMeta: metav1.ObjectMeta{Name: "metadata-job", Namespace: "default"},
}
err = wait.For(fwext.NewConditionExtension(cfg.Client().Resources()).JobSucceeded(job),
wait.WithContext(ctx))
Copy link
Contributor

@Pavani-Panakanti Pavani-Panakanti Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some comments around - purpose of this job and what it running

ARG EFA_INSTALLER_VERSION=latest
# Add ENV to make ARG values available at runtime
ARG EFA_INSTALLER_VERSION=1.34.0
ARG NCCL_VERSION=2.18.5
Copy link
Contributor

@Pavani-Panakanti Pavani-Panakanti Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using an older version of nccl here ? General recommendation is to use either of last 2 releases (preferred n-1 as latest might have issues)

)

type MetricManager struct {
// Metadata map[string]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to use this field later ?

@Pavani-Panakanti
Copy link
Contributor

@weicongw Can we also rebase the PR ? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants