Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E: use a self-hosted Consul for easier WI testing #20256

Merged
merged 3 commits into from
Apr 2, 2024
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Mar 29, 2024

Our consulcompat tests exercise both the Workload Identity and legacy Consul token workflow, but they are limited to running single node tests. The E2E cluster is network isolated, so using our HCP Consul cluster runs into a problem validating WI tokens because it can't reach the JWKS endpoint. In real production environments, you'd solve this with a CNAME pointing to a public IP pointing to a proxy with a real domain name. But that's logistically impractical for our ephemeral nightly cluster.

Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can reach each other. This will allow us to update our Consul tests so they can use Workload Identity, in a separate PR.

An important note here is that right now this only runs Consul Enterprise, so the test runner needs to provide a license file. We only run nightly on Nomad Enterprise anyways so this isn't a huge deal, but in later work it might be nice to be able to run against Consul CE as well.

Ref: #19698

Copy link
Member

@gulducat gulducat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +160 to +162
# We can't both bootstrap the ACLs and use the Consul TF provider's
# resource.consul_acl_token in the same Terraform run, because there's no way to
# get the management token into the provider's environment after we bootstrap,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classic conundrum 🙃

it has been a while, but I swear there has been some TF work for multi-stage behavior over the last couple years. I dunno the details, or how much of that made its way into CE/CLI executions (may be cloud only), or if it's really worth chasing down for this when we have a perfectly okay-enough kludge to do the thing 😋

we can do separate TF runs though, since we're in a shelly CI environment, if we want to, someday.

Copy link
Member Author

@tgross tgross Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny thing is we already do have separate TF runs, because we use TF to get the Vault token. So we could probably split this all out into:

  • TF for Vault token
  • TF to deploy the infra and upload all the config files
  • TF to run the bootstrap ACL scripts for Consul and Nomad

I'd definitely like to make this less miserable, but we're already doing the same thing for Nomad ACLs. So let's revisit that under a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Alternately, we could lean on what our colleagues working on the "Nomad bench" project are doing and just pull in Ansible to do all the non-infra work, which would certainly make the TF easier!)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum....I haven't tested it yet, but it doesn't seem like we use consul_* resources before ACLs are bootstrapped, so I think that we could use a data "http" to wait until the Consul agent is fully bootstrapped and have provisioner "consul" depend on it so that any Consul resource is only attempted to be created once the provisioner has been configured with the token.

A few "new" toys that could help here would be

Another non-Terraform option would be to pre-generate a bootstrap token and place it in the acl.initial_management config.

But probably not worth refactoring all of this 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think that we could use a data "http"

TIL about data.http. Could be handy for sure.

Another non-Terraform option would be to pre-generate a bootstrap token and place it in the

We're doing that in this PR, but in my testing I still saw "ACLs haven't been bootstrapped" errors. Maybe I'm doing something wrong though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just remembered an example where I used data.http to wait for Nomad:
https://gist.github.com/lgfa29/b707d56ace871602cb4955df2a1afad0#file-main-tf-L137-L144

Configuring the provider with the request output allows every other nomad_* resource to wait until everything is ready.
https://gist.github.com/lgfa29/b707d56ace871602cb4955df2a1afad0#file-main-tf-L5-L8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think there's enough here where we could eliminate these two shell scripts. The data.http for Nomad might have to get a little creative inasmuch as we can't give it a client TLS cert, but I think we probably want to eliminate the client verification on HTTP TLS anyways so that the cluster is in line with our current security recommendations.

I've got a small pile of minor refactors I want to tackle once I've got my immediate yak shave done. I'll add this to that pile.

e2e/terraform/consul-servers.tf Outdated Show resolved Hide resolved
e2e/terraform/etc/consul.d/servers.hcl Outdated Show resolved Hide resolved
e2e/terraform/outputs.tf Show resolved Hide resolved
file_permission = "0600"
}

resource "null_resource" "upload_consul_server_configs" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we use it in several other places, so we don't need to change things now, but null_resource is being replaced with terraform_data:
https://developer.hashicorp.com/terraform/language/resources/terraform-data#example-usage-null_resource-replacement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we have a ton of that sort of thing. Happy to take a pass in a separate PR to modernize some of our TF code though for sure.

Comment on lines +160 to +162
# We can't both bootstrap the ACLs and use the Consul TF provider's
# resource.consul_acl_token in the same Terraform run, because there's no way to
# get the management token into the provider's environment after we bootstrap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum....I haven't tested it yet, but it doesn't seem like we use consul_* resources before ACLs are bootstrapped, so I think that we could use a data "http" to wait until the Consul agent is fully bootstrapped and have provisioner "consul" depend on it so that any Consul resource is only attempted to be created once the provisioner has been configured with the token.

A few "new" toys that could help here would be

Another non-Terraform option would be to pre-generate a bootstrap token and place it in the acl.initial_management config.

But probably not worth refactoring all of this 😅

e2e/terraform/consul-servers.tf Outdated Show resolved Hide resolved
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul
token workflow, but they are limited to running single node tests. The E2E
cluster is network isolated, so using our HCP Consul cluster runs into a
problem validating WI tokens because it can't reach the JWKS endpoint. In real
production environments, you'd solve this with a CNAME pointing to a public IP
pointing to a proxy with a real domain name. But that's logisitcally
impractical for our ephemeral nightly cluster.

Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our
Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can
reach each other. This will allow us to update our Consul tests so they can use
Workload Identity, in a separate PR.

Ref: #19698
@tgross tgross merged commit cf25cf5 into main Apr 2, 2024
8 checks passed
@tgross tgross deleted the e2e-in-cluster-consul branch April 2, 2024 19:24
@tgross tgross added the backport/1.7.x backport to 1.7.x release line label Apr 3, 2024
@tgross tgross removed the backport/1.7.x backport to 1.7.x release line label Apr 3, 2024
philrenaud pushed a commit that referenced this pull request Apr 18, 2024
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul
token workflow, but they are limited to running single node tests. The E2E
cluster is network isolated, so using our HCP Consul cluster runs into a
problem validating WI tokens because it can't reach the JWKS endpoint. In real
production environments, you'd solve this with a CNAME pointing to a public IP
pointing to a proxy with a real domain name. But that's logisitcally
impractical for our ephemeral nightly cluster.

Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our
Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can
reach each other. This will allow us to update our Consul tests so they can use
Workload Identity, in a separate PR.

Ref: #19698
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/e2e theme/testing Test related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants