This repo provides a git-centric workflow to deploy Amazon SageMaker HyperPod clusters, with quality-of-life improvements and advance architecture.
Unsure if this repo is right for you? Not to worry, we invite you to check if the vibe of this repo resonates with you, before making an informed decision.
Differences against adt#760c0b9
- deprecate timesync LCC. Script is still available for older clusters that need it applied in ad-hoc fashion (and it ensures chrony does not start with a network namespace).
- hardened
setup_mariadb_accounting.sh
. - allow ssh to compute nodes without host keys.
- enable enroot containers, but disable the CLIs for non-root users on login and controller nodes which may have insufficient root volume for container operations. Non-root users must perform container operations (e.g., build images) on compute nodes with NVMe.
- enable multi-users via LDAPS. Note that're two independent parts:
- an example to setup an LDAPS endpoint. Ignore this when you have an existing LDAPS.
- an LCC script to get a cluster connect to an LDAPS endpoint.
- hold Lustre client module to prevent accidental kernel upgrade
- install a mock .deb package to prevent accidental upgrade to GPU driver
- disable and mask GDM (GNOME Display Manager).
- utility scripts for SMHP client (bin/). Non-exhaustive highlights:
dashboard-cluster-create.sh
anddashboard-cluster-update.sh
to show side-by-side the cluster creation or update status, and the controller logs. Require tmux and awslogs.cluster-status.sh
can export the JSON payload returned byaws sagemaker describe-cluster ...
into the JSON format forcluster-config.yaml
. Useful to regenerate acluster-config.yaml
for another deployment.cluster-log.sh
supports watch mode and one-time mode. The watch mode implements retry logic to wait for LCC logs to appear in your Cloudwatch log streams. Require awslogs.show-az.sh
to quickly maps AZ name to AZ id. Typically used when planning cluster deployment.
- utility scripts for the cluster (src/sample-slurm-jobs/): trigger unhealthy instance and auto-resume Slurm step, probe ami, etc.
- other opinionated changes to shell and environment. Feel free to customize the initsmhp scripts.
Deployment overview:
## Dependencies:
## - mandatory: awscli, boto3, jq
## - optional : awslogs, tmux
# Step 1.1. Create S3 bucket in whatever way you like
# Step 1.2. Create a self-signed certificate and LDAP auth token
# Step 1.3. Update environment variables and add bin/ to PATH.
vi profile.sh src/lcc-data/profile.sh
source profile.sh
# Step 1.4. Create VPC with at least two AZs
cfn.sh
# Step 1.5. Create FSx Lustre in whatever way you like
# Step 1.6. Create AWS Managed MS AD with LDAPS endpoint
# Step 1.7. Create SMHP cluster
vi config-cluster.yaml src/lcc-data/provisioning_parameters.json
python3 bin/validate-config.py
## Optional: customize files under src/LifecycleScripts/base-config/ and/or src/lcc-data/
# vi ...
cluster-create.sh <CLUSTER_NAME> [--profile xxxxx]
dashboard-cluster-create.sh <CLUSTER_NAME> [--profile xxxxx] # Optional
Let us now proceed to the detail steps. Before proceeding, in case you don't wish to deploy an AD, please click below for the added instructions.
How to skip AD deployment
To not setup AD (and the LDAPS integration with the cluster):
- ignore Section 1.2 and Section 1.6.
- in Section 1.3, make sure
that
src/lcc-data/profile.sh
setsSMHP_LDAP_TOKEN_ARN
andSMHP_LDAP_CERT_ARN
to blank values.
Make sure to block the public access.
π¨π¨π¨ Skip this step when you're going to connect to your existing LDAPS π¨π¨π¨
Now, let's follow some of the steps in AWS ParallelCluster
tutorial.
You can do these steps on your computer as long as it has the right AWS credentials to execute
the aws
CLIs.
-
Go to
Step 1: Create the AD infrastructure
/Manual
. -
Under
Add users to the AD
:Ignore step 1-3, and straight away jump to
4. Add the password to a Secrets Manager secret.
Essentially, we are deciding ahead of the AD existence what the LDAP's read-only credential will be, and store that as an AWS Secrets Manager secret. You're strongly recommended to change the example password to something else.
-
Under
LDAPS with certificate verification (recommended) setup
:-
1. Generate domain certificate, locally
on your computer. REMINDER: change the domain name as needed. -
2. Store the certificate to Secrets Manager to make it retrievable from within the cluster later on.
-
4. Import the certificate to AWS Certificate Manager (ACM).
-
Review and edit profile.sh
:
- use the bucket name created in Section 3.1
- make sure that
SMHP_AZ_NAME
must include the AZ where your cluster will live.
Review and edit src/lcc-data/profile.sh
:
- ARN of the LDAP read-only secret created in Section 1.2
- ARN to the certificate created in Section 1.2
- π¨π¨π¨ If you want to skip setting-up LDAPS integration, just set both to blank values π¨π¨π¨
After that, source ./profile.sh
to set the environment variables for your current shell. The
reminder of this quickstart will need these env vars. REMINDER: always do this step when
starting a new shell.
The VPC will have two AZs as AWS Managed Microsoft AD requires two private subnets.
# Deploy a CloudFormation template.
cfn.sh
In case you need to update the stack already deployed, please edit the necessary files (e.g.,
profile.sh
, bin/cfn.sh
, or possibly even the src/01-smhp-vpc.yaml
), then update the stack as
follows:
# Update an existing CloudFormation stack.
cfn.sh update
You may create an FSx Lustre filesystem using the AWS console. Make sure to select the security group from the VPC stack.
π¨π¨π¨ Skip this step when you're going to connect to your existing LDAPS π¨π¨π¨
Follow the AWS ParallelCluster tutorial, but only for these specific steps.
-
Create the AD either by using the Python scripts in
Step 1: Create the AD infrastructure
/Manual
, or using the AWS console. It's important to choose two private subnets from the VPC stack you've just created in the previous step.REMINDER: change the directory and domain name as needed, and optionally other information as you like.
-
Follow
Create an EC2 instance
,Join your instance to the AD
, andAdd users to the AD
. REMINDER: make sure theReadOnlyUser
uses the same password as your AWS Secrets Manager secret. -
Under
LDAPS with certificate verification (recommended) setup
, do all steps except what you've done in Section 1.2, no. 3 (i.e., skip1
,2
, and4
).
The expected outcomes of this section are:
- an AD
- an EC2 instance to configure the AD (i.e., add new users)
ReadOnlyUser
which SMHP will use to connect to this AD via LDAPS, and a test useruser000
which you may skip.- an LDAPS endpoint for the AD, in the form of an Network Load Balancer with the certificate which you previously imported to the ACM.
- an Amazon Route53 hosted zone (i.e., DNS records) to let the SMHP cluster resolve the LDAPS endpoint.
-
Review and edit
cluster-config.yaml
. As a rule of thumb, anything withxxx
needs to be updated. -
Review and edit
src/lcc-data/*
files. As a rule of thumb, anything withxxx
needs to be updated. -
Optionally, review and edit
src/LifecycleScripts/*
. You should leave them as default unless you want to make your own customizations. -
Optionally, run
python3 bin/validate-config.py
to ensures the above configurations are sound. Note that the.py
scripts requireboto3
. -
Now it's time to create a cluster:
cluster-create.sh <CLUSTER_NAME> [--profile xxxx] # Optional: side-by-side displays of cluster status and controller logs. Need awslogs and tmux. dashboard-cluster-create.sh <CLUSTER_NAME> [--profile xxxxx]
For simplicity, connect to the controller node (or login nodes, should you deploy them).
easy-ssh.sh <CLUSTER_NAME>
This should bring you to the controller node:
# Connect via ssm, so need to switch user.
$ sudo su -l ubuntu
$ sinfo
...
You can also test out the sample AD user you created in Section 1.6:
# Connect via ssm, so need to switch user. This time to the sample AD user.
sudo su -l user000
$ whoami
user000
echo $HOME
/fsx/home/user000
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.