This repo includes:
- Documentation to cover core Nomad concepts, client (control plane) / server (worker) architecture, jobs, tasks and allocations
- A terraform config for provisioning a 3 client / 3 server Nomad cluster in AWS
- Some sample jobs
Nomad is packaged as a single executable, it is written in GOLANG and generally runs anywhere that supports the Linux operating system, including IBM s390x based mainframes.
A Nomad cluster consists of two main elements:
- Client nodes, these make up the control plan
- Worker nodes, where orchestrated jobs are run
Clusters can be multi region and the clients nodes can be grouped into Node pools:
Gossip protocol plays a key part in the role of cluster node membership.
Users interact with Nomad clusters via jobs, these in turn encapsulate other constructs including tasks. The are a variety of ways for deploying jobs to a cluster and managing them, including:
Nomad comes with an ACL system and the ability for node-to-node communications to be secured with TLS
A key differentiator between Nomad and other orchestrators such as Kubernetes is the fact that Nomad can orchestrate a wide variety of job types via task drivers. Simply put, if a task driver exists for a schedulable entity, Nomad can orchestrate that entity. HashiCorp provides first party supported task drivers and the ecosystem also supports community written task drivers.
The raw exec tasks driver provides shell out like capabilities for running jobs, but should be used with caution due to the fact that any job that runs under this driver runs as the same user that the Nomad nodes run as, therefore isolated exec should generally be used in preference to this.
A Nomad job consists of a key number of elements, the example below is rendered in Nomad HCL:
- region are defined at server configuration level.
- data centers specifies the data centers in the region that jobs are to be spread over.
- type
specifies the type of job, jobs intended to run idenfinitely specify a type of
service
as per the example - group acts a container for speciying which tasks should be executed on the same client, this is analagous to a pod in Kubernetes parlance.
- task is the finest grained atomic unit of work Nomad can execute.
- task driver used by Nomad clients to execute a task and provide resource isolation.
Full documentation on the complete set of job specification options can be found here.
By default Nomad uses the bin packing algorithm in order to schedule jobs, however specific client nodes can be targetted via the affinity stanza and allocations can be spread across data centers via the spread stanza. Nomad 1.7 also introduces NUMA aware scheduling (Enterprise edition) which is useful for latency sensitive use cases such as low latency trading. An allocation is a core concept linked to scheduling in Nomad, allocations are used to map tasks in a job to client.
Refer to the Nomad documentation on [scheduling (https://developer.hashicorp.com/nomad/docs/concepts/scheduling/scheduling) for further information on this topic.
Nomad 1.7 introduced support for workload identities. Simply put, a JWT is generated that is unique for the allocation the job runs in.
The primary use case of workload identity allow Nomad to authenticate with third parties via OIDC (including Vault and Consul).
- Clone this repo:
$ git clone https://github.com/ChrisAdkin8/Nomad-101-Demo.git
-
cd into the Nomad-101-Demo/terraform directory.
-
Open the terraform.tfvars file and assign:
- an AMI id to the ami variable, the default in the file is for Ubuntu 22.04 in the
us-east-1
region, leave this as is if this is the region being deployed to, otherwise change this as is appropriate - the string that this command generates to
nomad_gossip_key
in theterraform.tfvars
file. nomad_license
: the Nomad Enterprise license (only if using ENT version)- uncomment the Nomad Enterprise / Nomad OSS blocks as appropriate
- Change directory to the certificates ca directory:
$ cd terraform/certificates/ca
- Create the tls CA private key and certificate:
$ nomad tls ca create
- Create the nomad server private key and certificate and move them to the servers directory:
$ nomad tls cert create -server -region global
$ mv *server*.pem ../servers/.
- Create the nomad client private key and certificate and move them to the clients directory:
$ nomad tls cert create -client
$ mv *client*.pem ../clients/.
- Create the nomad cli private key and certificate and move them to the cli directory:
$ nomad tls cert create -cli
$ mv *client*.pem ../cli/.
- Change directory to
Nomad-Vm-Workshop/terraform
:
$ cd ../..
- Specify the environment variables in order that terraform can connect to your AWS account:
export AWS_ACCESS_KEY_ID=<your AWS access key ID>
export AWS_SECRET_ACCESS_KEY=<your AWS secret access key>
export AWS_SESSION_TOKEN=<your AWS session token>
- Install the provider plugins required by the configuration:
$ terraform init
- Apply the configuration, this will result in the creation of 23 new resources:
$ terraform apply -auto-approve
- The tail of the
terraform apply
output should look something like this:
Apply complete! Resources: 29 added, 0 changed, 0 destroyed.
Outputs:
IP_Addresses = <<EOT
Nomad Cluster installed
SSH default user: ubuntu
Server public IPs: 54.172.43.18, 18.212.218.138, 184.72.134.0
Client public IPs: 54.167.92.93, 54.80.76.185, 52.73.202.229
If ACL is enabled:
To get the nomad bootstrap token, run the following on the leader server
export NOMAD_TOKEN=$(cat /home/ubuntu/nomad_bootstrap)
EOT
lb_address_consul_nomad = "http://54.172.43.18:4646"
- ssh access to the nomad cluster client and server EC2 instances can be achieved via:
$ ssh -i certs/id_rsa.pem ubuntu@<client/server IP address>
- Once ssh'ed into one of the EC2 instances check that the nomad system unit is in a healthy state, note that depending on the EC2 instance you ssh onto, that instance may or may not be the current cluster leader:
$ systemctl status nomad
● nomad.service - Nomad
Loaded: loaded (/lib/systemd/system/nomad.service; disabled; vendor preset: enabled)
Active: active (running) since Mon 2024-01-08 11:42:16 UTC; 2min 3s ago
Docs: https://nomadproject.io/docs/
Main PID: 5617 (nomad)
Tasks: 7
Memory: 86.4M
CPU: 2.706s
CGroup: /system.slice/nomad.service
└─5617 /usr/bin/nomad agent -config /etc/nomad.d
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.543Z [INFO] nomad.raft: entering leader state: leader="Node at 172.31.206.75:4647 [Leader]"
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.543Z [INFO] nomad.raft: added peer, starting replication: peer=575c8e14-e841-7b67-7e72-8679b0632aae
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.543Z [INFO] nomad.raft: added peer, starting replication: peer=44b7d1e8-8c04-c33f-e1ab-ca843c4d5567
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.543Z [INFO] nomad: cluster leadership acquired
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.544Z [INFO] nomad.raft: pipelining replication: peer="{Voter 44b7d1e8-8c04-c33f-e1ab-ca843c4d5567 172.31.74.132:4647}"
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.547Z [INFO] nomad.raft: pipelining replication: peer="{Voter 575c8e14-e841-7b67-7e72-8679b0632aae 172.31.81.190:4647}"
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.578Z [INFO] nomad.core: established cluster id: cluster_id=98469698-6731-35c2-682e-02e6e76d8aed create_time=1704714145567062938
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.578Z [INFO] nomad: eval broker status modified: paused=false
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.578Z [INFO] nomad: blocked evals status modified: paused=false
Jan 08 11:42:25 ip-172-31-206-75 nomad[5617]: 2024-01-08T11:42:25.817Z [INFO] nomad.keyring: initialized keyring: id=56c026c8-0f96-fb71-5dca-20961686da10
Note The process of nomad and consul components being installed by cloudinit may take an extra 30 seconds or so after the terraform config has been applied.
- Whilst still ssh'd into one of the nomad nodes, bootstrap the nomad ACL system:
$ nomad acl bootstrap
nomad acl bootstrap
Accessor ID = 29604ac7-da5c-4b4c-50e6-8d6d78856ba2
Secret ID = b0c12a19-552g-c073-56c1-d438aafb37ag
Name = Bootstrap Token
Type = management
Global = true
Create Time = 2024-01-08 11:44:38.673696794 +0000 UTC
Expiry Time = <none>
Create Index = 19
Modify Index = 19
Policies = n/a
Roles = n/a
- Assign the secret id from the output from the last command to a NOMAD_TOKEN environment variable:
$ export NOMAD_TOKEN=<secret id obtained from nomad acl bootstrap output>
- Check that all three nomad cluster server nodes are in a healthy state:
$ nomad server status
Name Address Port Status Leader Raft Version Build Datacenter Region
ip-172-31-206-75.global 172.31.206.75 4648 alive true 3 1.7.2 dc1 global
ip-172-31-74-132.global 172.31.74.132 4648 alive false 3 1.7.2 dc1 global
ip-172-31-81-190.global 172.31.81.190 4648 alive false 3 1.7.2 dc1 global