This guidance aims to instruct and guide users how to build a cloud HPC for the new NIST SP 800-223 standard using AWS Parallel Cluster. The National Institute of Standards and Technology (NIST) has published NIST SP 800-223: High-Performance Computing (HPC) Security: Architecture, Threat Analysis, and Security Posture standard. This new standard provides guidance on how to configure and secure an HPC cluster. This guidance instructs users on how to build a cloud HPC for the new NIST SP 800-223 compliance using AWS CloudFormation and AWS Parallel Cluster.
Amazon Web Services (AWS) provides the most elastic and scalable cloud infrastructure to run your hpc workloads. With virtually unlimited capacity - engineers, researchers, HPC system administrators, and organizations can innovate beyond the limitations of on-premises HPC infrastructure.
High Performance Compute (HPC) on AWS removes the long wait times and lost productivity often associated with on-premises HPC clusters. Flexible HPC cluster configurations and virtually unlimited scalability allows you to grow and shrink your infrastructure as your workloads dictate, not the other way around.
This guidance provides a comprehensive approach to deploying a secure, compliant, and high-performance HPC environment on AWS. It addresses the unique security challenges of HPC systems while maintaining the performance requirements critical for computationally intensive workloads.
We developed this guidance in response to the growing need for secure HPC environments in cloud settings. Many organizations, especially those in research, engineering, and data-intensive fields, require immense computational power but struggle to balance this with stringent security and compliance requirements. The NIST SP 800-223 publication provides an excellent framework for addressing these challenges, and we wanted to demonstrate how to implement these recommendations using AWS services.
Architecture diagrams below show sample NIST 800-223 based infrastructure architecture, provisoning and deployment process using cloudformation, HPC Cluster deployment, and user interactions via AWS ParallelCluster. Depending on the region you deploy the guidance in, it will automatically scale to from 2-4 AZs in order to maximize availability and redundancy of your cluster.
(1) Admin/DevOps users can deploy this architecture using a series of AWS CloudFormation templates. These templates provision networking resources, including Amazon Virtual Private Cloud (Amazon VPC) and subnets. The templates also provision resources for security and storage, such as Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), and Amazon FSx for Lustre. There are optional templates included to deploy a Slurm accounting database (DB) and a Microsoft Active Directory user directory.
(2) Four logical subnets (zones) are created, each in multiple Availability Zones (AZs), based on the target AWS Region. All required networking, networking access control list (ACLs), routes, and security resources are deployed. deployed subnets for the Storage Zone. An FSx for Lustre file system is created that is used as a highly performant scratch file system in the preferred AZ. The four zones are: 1) Access Zone (public subnet), 2) Compute Zone, 3) Management Zone, and 4) Storage Zone (all private subnets).
(3) An Amazon RDS for MySQL instance is created that will be used as the Slurm Accounting Database. This is set up in a single zone, or can be modified to be multi-AZ if preferred. One AWS Directory Service user directory is created across two AZs.
(4) An Amazon EFS file system is created for shared cluster storage that is mounted in all of the deployed subnets for the Storage Zone. An FSx for Lustre file system is created that is used as a highly performant scratch file system in the preferred AZ.
(5) Two Amazon S3 buckets are created: one for campaign storage using Amazon S3 Intelligent-Tiering, and one for archival storage using Amazon S3 Glacier.
(6) Random passwords are generated for both the Slurm accounting database and the Directory Service that are stored securely in AWS Secrets Manager.
(1) Admin/DevOps users use the AWS ParallelCluster AWS CloudFormation stack to deploy HPC resources. Resources can reference the network, storage, security, database, and user directory from the previously launched CloudFormation stacks.
(2) The AWS ParallelCluster CloudFormation template provisions a sample cluster configuration, which includes a head node deployed in a single Availability Zone within the Management zone. It also provisions a login node deployed in a single Availability Zone within the Access zone.
(3) The Slurm workload manager is deployed on the head node and used for managing the HPC workflow processes.
(4) The sample cluster configuration included creates two Slurm queues that provision compute nodes within the Compute zone. One queue uses compute-optimized Amazon Elastic Compute Cloud (Amazon EC2) instances, while the other queue utilizes GPU-accelerated EC2 instances.
(5) Users access this guidance by establishing a connection to the deployed login node within the Access zone, utilizing either a NICE DCV, SSH, or an AWS Systems Manager Session Manager.
(6) Users authenticate to the log in node using a username and password stored in the AWS Managed Microsoft Active Directory.
You are responsible for the cost of the AWS services deployed and used running this guidance. As of December 2024, the cost for running this guidance with the default settings in the US East (N. Virginia) region is approximately $1,156 per month.
We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.
The following table provides a sample cost breakdown for deploying this guidance with the default parameters in the US East (N. Virginia) us-east-1
Region for one month.
Stack Name | AWS Services | Cost [USD] |
---|---|---|
Network | VPC, Subnets, NAT Gateway, VPC Endpoints | $596.85/month |
Security | Security Groups | $0.00/month |
Storage | S3, EFS, FSx, EBS | $172.19/month |
Slurm Accounting | RDS Database | $73.84/month |
Active Directory | Managed AD (Enterprise) | $288.00/month |
Cluster | Head node, Login node | $25.00/month |
Total | $1155.88/month |
Note: This focus of this Guidance is to provide an example of securing the underlying AWS services and infrastructure that an HPC cluster will eventually run on. It does not aim to include any costs related to running actual HPC workloads. Please use the AWS Pricing Calculator to estimate any additional costs related to your specific HPC workload usecase.
When you build systems on AWS infrastructure, security responsibilities are shared between you and AWS. This shared responsibility model reduces your operational burden because AWS operates, manages, and controls the components including the host operating system, the virtualization layer, and the physical security of the facilities in which the services operate. For more information about AWS security, visit AWS Cloud Security.
AWS ParallelCluster users can be securely authenticiated and authorized using Amazon Manageged Microsoft Active Directory. HPC cluster EC2 components are deployed into a Virtual Private Cloud (VPC) which provides additional network security isolation for all contained components. Login Node is depoyed into a Public subnet and available for access via secure connections (SSH and SSM), Head Node is depoyed into a Private subnet and available for access via secure connections (SSH and SSM), compute nodes are deployed into Private subnet and managed from Head node via SLURM package manager, Slurm accounting database is deployed into a Private subnet and managed from the Head node using Slurm. Data stored in Amazon S3, Amazon EFS, and Amazon FSx for Lustre is enrypted at rest and in transit. Access to other AWS services from AWS ParallelCluster components are secured over VPC Endpoints from a Private management subnet.
See CONTRIBUTING for more information.
If you prefer to use SSH to access the login node or head node you will need to create a new SSH keypair in your account before launching the ParallelCluster CloudFormation template.
To do that, please follow the steps below:
- Login to your AWS account
- In the search bar at the top of the screen type in EC2
- In the list of services select EC2
- In the left-hand menu select Key Pairs under the Network & Security section
- Click Create key pair
- Enter a key pair name
- Select your preferred key pair type and format and click Create key pair
- This will automatically start a download of the private key for the key pair you just created
- Save this key in a secure location (this key can act as your password to login to the nodes launched by this template)
This deployment requires you have access to Amazon CloudFormation in your AWS account with permissions to create the following resources:
AWS Services in this Guidance:
- Amazon VPC
- Amazon CloudWatch
- Amazon Identity and Access Management (IAM)
- Amazon Elastic Compute Cloud (EC2)
- Amazon Elastic File System (EFS)
- Amazon Elastic Block Store (EBS)
- Amazon FSx for Lustre (FSxL)
- Amazon Relational Database Service (RDS)
- Amazon Dirctory Service
- AWS Secrets Manager
- AWS Systems Manager
- All service used by AWS ParallelCluster
- Clone the sample code repository:
git clone https://github.com/aws-solutions-library-samples/guidance-for-building-nist-sp-800-223-hpc-on-aws.git
- Change Directory to the deployment folder inside the repo
cd guidance-for-building-nist-sp-800-223-hpc-on-aws/deployment
- Locate the six Amazon CloudFormation templates and review them in order in a text editor of your choice or in the Amazon CloudFormation console
- In most cases you will want to use the default settings, however, you have the ability to modify these templates to your specific needs. Below are default stack source file names with corresponding CFN stack names:
Template File Name | Default Stack Name |
---|---|
0_network.yaml | nist-network |
1_security.yaml | nist-security |
2_storage.yaml | nist-storage |
3_slurm_db.yaml | nist-database |
4_active_directory.yaml | nist-ad |
5_pcluster.yaml | nist-hpc |
- Open a browser and login to your AWS Account
- Locate the search bar at the top of your screen and type in
CloudFormation
- When presented with a list of services click
CloudFormation
to open the CloudFormation console:
- Click the
Create Stack
button:
- In the
Prepare template
section selectChoose an existing template
:
- In the
Specify template
section selectUpload a template file
- Click the
Choose file
button:
- Navigate to the location on your local computer where you cloned the sample code and navigate into the
deployment
folder. There you will find the CloudFormation templates prefaced with a number that will indicate the order to execute them in. - Select the first template titled
0_network.yaml
- For each template you will be asked to provide a Stack name - this name must be a unique stack name for the region you are deploying in.
Important: The stack name should be noted for use in deployment of other templates. Downstream services will need to know this stack name in order to reference Amazon Resource Names (ARNs) or resource IDs that will be exported/output for each template
- For the
Network
stack, review the parameters and adjust as needed based on your specific use case or requirements - Once you have reviewed and validated the parameters, click the
Next
button at the bottom of the page - Leave the default options on the
Configure stack options
page - You will need to scroll to the bottom of this page and select the check box to allow CloudFormation to create IAM resources on your behalf
- Click
Next
- On the
Review and create
screen review your selections again and then click the Submit button at the bottom of the page:
- Your CloudFormation stack will begin deploying
- You can monitor the deployment progress in the AWS CloudFormation console
- Wait until you see the stack status update from "CREATE_IN_PROGRESS" to "CREATE_COMPLETE" before moving on to deploying the next template
- You can review the outputs generated by the stack by going to the
Outputs
tab for each stack or going to theExports
page on the left-hand menu- Note: The export values will be used by later templates to reference resources created in earlier templates
Stack Outputs View:
Stack Exports View:
- Repeat the steps above starting with step 7. for the rest of guidance stacks, moving on to the next stack source file in the
deployment
folder (listed in Step 3. above)
Important: Stacks 1-5 have input parameters that ask for the previously deployed stack names. If you modify the stack names from their default values in Step 3, you will need to also update the parameters in each subsequent stack with the appropriate name so that the relevant services can be referenced.
Below as in example of modified input parameters used for deployment of the hpc-pc-cluster
stack:
Note: The storage, Slurm database, Active Directory, and AWS ParallelCluster stacks are intended to be simple examples for testing the NIST SP 800-223 reference architecture. For more production ready versions of these templates, please see AWS HPC Recipes repository
- Open CloudFormation console and verify the status of the guidance templates with the names listed in Step 3.
- Make sure that all CloudFormation stacks are deployed successfully, with resulting status of "CREATE_COMPLETE"
You can also open a nested stack with the name starting with c-nist
and look into its Outputs
tab to get values of some important HPC cluster parameters:
If you need to verify that specified high performance FSx storage was provisioned, navigate to the FSx section of AWS Console and look for the FSx for Lustre entry like:
Other services/resources deployed by guidance can also be obtained from outputs of respective CloudFormation stacksNow that you have successfully deployed the infrastructure need to comply with the guidelines and recommendations outlined in the NIST SP 800-223.
You can begin using the cluster by logging in to the management node to review or modify any of the Slurm settings. You can use SSM to securely open a terminal session to either the Login node or the Head node by following the instructions below:
- In the search bar above type in
EC2
- In the list of services select
EC2
- On the left hand menu select
Instances
- Locate either the head node or the Login node and select that instance by checking the box to the left of it
- Locate the
Connect
button near the top of the screen - In the window that opens click the
Session Manager
tab - Click the
Connect
button to open a secure terminal session in your browser
Alternatively, when you launch the 5_pcluster.yaml
CloudFormation template, you can select an SSH Key pair that already exists in your AWS Account. If you completed the prerequistes steps to create a key pair you will see it populated in this list.
- Locate your SSH key pair
- Ensure you have the proper permissions set on the key pair (read-only access)
chmod 400 /path/key/ssh_key.pem
- In the list of services select
EC2
- On the left hand menu select
Instances
- Locate either the head node or the login node and select one instance by checking the box to the left of the instance
- Locate the
Connect
button near the top of the screen - In the window that opens click the SSH client tab
- Follow the instructions on the screen to login to your instance
You can follow steps in AWS documentation to run a simple job using Slurm workload scheduler.
You need to be using ec2-user
user to get access to all pre-installed HPC utilities, that can be achieved by changing shell prompt:
sudo su ec2-user
After that, you can run commands below to check HPC Cluster nodes/capacity, compute module availability, create a test Hello World
job:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu* up infinite 2 idle~ cpu-dy-cpu-[1-2]
gpu up infinite 2 idle~ gpu-dy-gpu-[1-2]
[ec2-user@ip-10-0-2-161 ~]$ !3
module avail
------------------------------------------------------------------------ /opt/amazon/modules/modulefiles -------------------------------------------------------------------------
libfabric-aws/1.22.0amzn1.0 openmpi/4.1.6 openmpi5/5.0.3
------------------------------------------------------------------------- /usr/share/Modules/modulefiles -------------------------------------------------------------------------
dot module-git module-info modules null use.own
-------------------------------------------------------------------- /opt/intel/mpi/2021.13/etc/modulefiles/ ---------------------------------------------------------------------
intelmpi/2021.13
[ec2-user@ip-10-0-2-161 ~]$ cat hellojob.sh
#!/bin/bash
sleep 30
echo "Hello World from $(hostname)"
[ec2-user@ip-10-0-2-161 ~]$ sbatch ./hellojob.sh
Submitted batch job 1
[ec2-user@ip-10-0-2-161 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 cpu hellojob ec2-user CF 0:08 1 cpu-dy-cpu-1
[ec2-user@ip-10-0-2-161 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 cpu hellojob ec2-user CF 0:26 1 cpu-dy-cpu-1
[ec2-user@ip-10-0-2-161 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 cpu hellojob ec2-user CF 0:28 1 cpu-dy-cpu-1
[ec2-user@ip-10-0-2-161 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 cpu hellojob ec2-user CF 0:33 1 cpu-dy-cpu-1
Slurm woukld launch a properly sized compute Node (EC2 instance) to run the scheduled job. Wait until the job completes (no output from squeue
command) and then check completed job output:
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
$ls -l
total 8
-rwxrwxr-x 1 ec2-user ec2-user 57 Dec 16 23:43 hellojob.sh
-rw-rw-r-- 1 ec2-user ec2-user 32 Dec 16 23:47 slurm-1.out
[ec2-user@ip-10-0-2-161 ~]$ cat slurm-1.out
Hello World from ip-10-0-54-243
This confirms that your provisioned HPC cluster can run a compute job on its node(s) controlled by SLURM.
When you no longer need to use the guidance, you should delete the AWS resources deployed by it in order to prevent ongoing charges for their use.
- In the AWS Management Console, navigate to CloudFormation and locate the 6 guidance stacks deployed.
- Starting with the most recent stack (not including any nested stacks), select the stack and click
Delete
button:
confirm your intend by clicking the Delete
button on the pop-up screen:
- Repeat this for each of the 6 CloudFormation stacks deployed to remove all resources from your account, confirm that all stacks get deleted.
Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.
Chris Riddle - Sr. Solutions Architect AWS
Daniel Zilberman - Sr. Solutions Architect AWS
This library is licensed under the MIT-0 License. See the LICENSE file.