Skip to content

Commit

Permalink
cryo: Allow GPU nodes to spawn across AZs
Browse files Browse the repository at this point in the history
We already do this for GCP, now we do it on AWS too.

Fixes #3334
  • Loading branch information
yuvipanda committed Nov 6, 2023
1 parent 668a272 commit a767754
Show file tree
Hide file tree
Showing 3 changed files with 18 additions and 8 deletions.
7 changes: 7 additions & 0 deletions docs/howto/features/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,9 @@ AWS, and we can configure a node group there to provide us GPUs.
tags+: {
"k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
},
// Allow provisioning GPUs across all AZs, to prevent situation where all
// GPUs in a single AZ are in use and no new nodes can be spawned
availabilityZones: masterAzs,
}
```

Expand All @@ -122,6 +125,10 @@ AWS, and we can configure a node group there to provide us GPUs.
1 GPU per node. If you're using a different machine type with
more GPUs, adjust this definition accordingly.

We use a prior variable, `masterAzs`, to allow for GPU nodes to spawn in all
AZ in the region, rather than just a specific one. This is helpful as a single
zone may run out of GPUs rather fast.

2. Render the `.jsonnet` file into a `.yaml` file that `eksctl` can use

```bash
Expand Down
3 changes: 3 additions & 0 deletions eksctl/nasa-cryo.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ local notebookNodes = [
tags+: {
"k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu": "1"
},
// Allow provisioning GPUs across all AZs, to prevent situation where all
// GPUs in a single AZ are in use and no new nodes can be spawned
availabilityZones: masterAzs,
},
];

Expand Down
16 changes: 8 additions & 8 deletions terraform/aws/efs.tf
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
// Find out which subnet and security group our EFS mount target should be in
// Find out which subnets and security group our EFS mount target should be in
// It needs to be in the public subnet where our nodes are, as the nodes will be
// doing the mounting operation. It should be in a security group shared by all
// the nodes.
data "aws_subnet" "cluster_node_subnet" {
// the nodes. We create a mount target in each subnet, even if we primarily put
// all our nodes in one - this allows for GPU nodes to be spread out across
// AZ when needed
data "aws_subnets" "cluster_node_subnets" {

filter {
name = "vpc-id"
values = [data.aws_eks_cluster.cluster.vpc_config[0]["vpc_id"]]
}
filter {
name = "availability-zone"
values = [var.cluster_nodes_location]
}

filter {
name = "tag:aws:cloudformation:logical-id"
Expand Down Expand Up @@ -70,8 +68,10 @@ resource "aws_efs_file_system" "homedirs" {
}

resource "aws_efs_mount_target" "homedirs" {
for_each = toset(data.aws_subnets.cluster_node_subnets.ids)

file_system_id = aws_efs_file_system.homedirs.id
subnet_id = data.aws_subnet.cluster_node_subnet.id
subnet_id = each.key
security_groups = [data.aws_security_group.cluster_nodes_shared_security_group.id]
}

Expand Down

0 comments on commit a767754

Please sign in to comment.