E2E: use a self-hosted Consul for easier WI testing #20256

tgross · 2024-03-29T20:38:46Z

Our consulcompat tests exercise both the Workload Identity and legacy Consul token workflow, but they are limited to running single node tests. The E2E cluster is network isolated, so using our HCP Consul cluster runs into a problem validating WI tokens because it can't reach the JWKS endpoint. In real production environments, you'd solve this with a CNAME pointing to a public IP pointing to a proxy with a real domain name. But that's logistically impractical for our ephemeral nightly cluster.

Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can reach each other. This will allow us to update our Consul tests so they can use Workload Identity, in a separate PR.

An important note here is that right now this only runs Consul Enterprise, so the test runner needs to provide a license file. We only run nightly on Nomad Enterprise anyways so this isn't a huge deal, but in later work it might be nice to be able to run against Consul CE as well.

Ref: #19698

gulducat

LGTM!

gulducat · 2024-04-01T19:55:29Z

e2e/terraform/consul-servers.tf

+# We can't both bootstrap the ACLs and use the Consul TF provider's
+# resource.consul_acl_token in the same Terraform run, because there's no way to
+# get the management token into the provider's environment after we bootstrap,


classic conundrum 🙃

it has been a while, but I swear there has been some TF work for multi-stage behavior over the last couple years. I dunno the details, or how much of that made its way into CE/CLI executions (may be cloud only), or if it's really worth chasing down for this when we have a perfectly okay-enough kludge to do the thing 😋

we can do separate TF runs though, since we're in a shelly CI environment, if we want to, someday.

Funny thing is we already do have separate TF runs, because we use TF to get the Vault token. So we could probably split this all out into:

TF for Vault token

TF to deploy the infra and upload all the config files

TF to run the bootstrap ACL scripts for Consul and Nomad

I'd definitely like to make this less miserable, but we're already doing the same thing for Nomad ACLs. So let's revisit that under a separate PR.

(Alternately, we could lean on what our colleagues working on the "Nomad bench" project are doing and just pull in Ansible to do all the non-infra work, which would certainly make the TF easier!)

Hum....I haven't tested it yet, but it doesn't seem like we use consul_* resources before ACLs are bootstrapped, so I think that we could use a data "http" to wait until the Consul agent is fully bootstrapped and have provisioner "consul" depend on it so that any Consul resource is only attempted to be created once the provisioner has been configured with the token.

A few "new" toys that could help here would be

https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external

https://registry.terraform.io/providers/hashicorp/http/latest/docs/data-sources/http

Another non-Terraform option would be to pre-generate a bootstrap token and place it in the acl.initial_management config.

But probably not worth refactoring all of this 😅

so I think that we could use a data "http"

TIL about data.http. Could be handy for sure.

Another non-Terraform option would be to pre-generate a bootstrap token and place it in the

We're doing that in this PR, but in my testing I still saw "ACLs haven't been bootstrapped" errors. Maybe I'm doing something wrong though.

I just remembered an example where I used data.http to wait for Nomad:
https://gist.github.com/lgfa29/b707d56ace871602cb4955df2a1afad0#file-main-tf-L137-L144

Configuring the provider with the request output allows every other nomad_* resource to wait until everything is ready.
https://gist.github.com/lgfa29/b707d56ace871602cb4955df2a1afad0#file-main-tf-L5-L8

Yeah I think there's enough here where we could eliminate these two shell scripts. The data.http for Nomad might have to get a little creative inasmuch as we can't give it a client TLS cert, but I think we probably want to eliminate the client verification on HTTP TLS anyways so that the cluster is in line with our current security recommendations.

I've got a small pile of minor refactors I want to tackle once I've got my immediate yak shave done. I'll add this to that pile.

e2e/terraform/consul-servers.tf

e2e/terraform/etc/consul.d/servers.hcl

e2e/terraform/outputs.tf

lgfa29 · 2024-04-02T14:44:35Z

e2e/terraform/consul-servers.tf

+  file_permission = "0600"
+}
+
+resource "null_resource" "upload_consul_server_configs" {


I know we use it in several other places, so we don't need to change things now, but null_resource is being replaced with terraform_data:
https://developer.hashicorp.com/terraform/language/resources/terraform-data#example-usage-null_resource-replacement

Yeah we have a ton of that sort of thing. Happy to take a pass in a separate PR to modernize some of our TF code though for sure.

lgfa29 · 2024-04-02T16:01:36Z

e2e/terraform/consul-servers.tf

+# We can't both bootstrap the ACLs and use the Consul TF provider's
+# resource.consul_acl_token in the same Terraform run, because there's no way to
+# get the management token into the provider's environment after we bootstrap,


Hum....I haven't tested it yet, but it doesn't seem like we use consul_* resources before ACLs are bootstrapped, so I think that we could use a data "http" to wait until the Consul agent is fully bootstrapped and have provisioner "consul" depend on it so that any Consul resource is only attempted to be created once the provisioner has been configured with the token.

A few "new" toys that could help here would be

https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external

https://registry.terraform.io/providers/hashicorp/http/latest/docs/data-sources/http

Another non-Terraform option would be to pre-generate a bootstrap token and place it in the acl.initial_management config.

But probably not worth refactoring all of this 😅

e2e/terraform/consul-servers.tf

Our `consulcompat` tests exercise both the Workload Identity and legacy Consul token workflow, but they are limited to running single node tests. The E2E cluster is network isolated, so using our HCP Consul cluster runs into a problem validating WI tokens because it can't reach the JWKS endpoint. In real production environments, you'd solve this with a CNAME pointing to a public IP pointing to a proxy with a real domain name. But that's logisitcally impractical for our ephemeral nightly cluster. Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can reach each other. This will allow us to update our Consul tests so they can use Workload Identity, in a separate PR. Ref: #19698

tgross added theme/e2e theme/testing Test related issues labels Mar 29, 2024

tgross modified the milestones: 1.7.x, 1.8.0 Mar 29, 2024

vercel bot deployed to Preview – nomad-storybook-and-ui March 29, 2024 20:42 View deployment

tgross force-pushed the e2e-in-cluster-consul branch from 2b1b0a9 to ab012cd Compare March 29, 2024 20:47

vercel bot deployed to Preview – nomad-storybook-and-ui March 29, 2024 20:49 View deployment

tgross force-pushed the e2e-in-cluster-consul branch from ab012cd to 45ba631 Compare April 1, 2024 13:19

vercel bot deployed to Preview – nomad-storybook-and-ui April 1, 2024 13:22 View deployment

tgross force-pushed the e2e-in-cluster-consul branch from 45ba631 to ba2cfae Compare April 1, 2024 15:40

vercel bot deployed to Preview – nomad-storybook-and-ui April 1, 2024 15:42 View deployment

tgross force-pushed the e2e-in-cluster-consul branch from ba2cfae to 7b8c955 Compare April 1, 2024 17:32

vercel bot deployed to Preview – nomad-storybook-and-ui April 1, 2024 17:35 View deployment

tgross force-pushed the e2e-in-cluster-consul branch from 7b8c955 to bf531cf Compare April 1, 2024 17:56

vercel bot deployed to Preview – nomad-storybook-and-ui April 1, 2024 17:59 View deployment

tgross marked this pull request as ready for review April 1, 2024 18:01

tgross requested review from gulducat, pkazmierczak and lgfa29 April 1, 2024 18:02

gulducat approved these changes Apr 1, 2024

View reviewed changes

vercel bot deployed to Preview – nomad-storybook-and-ui April 1, 2024 20:53 View deployment

vercel bot deployed to Preview – nomad-storybook-and-ui April 2, 2024 14:16 View deployment

tgross force-pushed the e2e-in-cluster-consul branch from 4688788 to 4da82cd Compare April 2, 2024 14:40

vercel bot deployed to Preview – nomad-storybook-and-ui April 2, 2024 14:43 View deployment

lgfa29 approved these changes Apr 2, 2024

View reviewed changes

tgross added 2 commits April 2, 2024 13:31

address comments from code review

f1cd5af

tgross force-pushed the e2e-in-cluster-consul branch from 4da82cd to f1cd5af Compare April 2, 2024 17:32

vercel bot deployed to Preview – nomad-storybook-and-ui April 2, 2024 17:34 View deployment

fix grpc port

b272799

vercel bot deployed to Preview – nomad-storybook-and-ui April 2, 2024 18:05 View deployment

tgross merged commit cf25cf5 into main Apr 2, 2024
8 checks passed

tgross deleted the e2e-in-cluster-consul branch April 2, 2024 19:24

tgross added the backport/1.7.x backport to 1.7.x release line label Apr 3, 2024

hc-github-team-nomad-core mentioned this pull request Apr 3, 2024

Backport of E2E: use a self-hosted Consul for easier WI testing into release/1.7.x #20274

Closed

tgross removed the backport/1.7.x backport to 1.7.x release line label Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E: use a self-hosted Consul for easier WI testing #20256

E2E: use a self-hosted Consul for easier WI testing #20256

tgross commented Mar 29, 2024 •

edited

Loading

gulducat left a comment

gulducat Apr 1, 2024

tgross Apr 1, 2024 •

edited

Loading

tgross Apr 1, 2024

lgfa29 Apr 2, 2024

tgross Apr 2, 2024

lgfa29 Apr 2, 2024

tgross Apr 2, 2024

lgfa29 Apr 2, 2024

tgross Apr 2, 2024

lgfa29 Apr 2, 2024

E2E: use a self-hosted Consul for easier WI testing #20256

E2E: use a self-hosted Consul for easier WI testing #20256

Conversation

tgross commented Mar 29, 2024 • edited Loading

gulducat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross Apr 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross commented Mar 29, 2024 •

edited

Loading

tgross Apr 1, 2024 •

edited

Loading