-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E: use a self-hosted Consul for easier WI testing #20256
Conversation
2b1b0a9
to
ab012cd
Compare
ab012cd
to
45ba631
Compare
45ba631
to
ba2cfae
Compare
ba2cfae
to
7b8c955
Compare
7b8c955
to
bf531cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
# We can't both bootstrap the ACLs and use the Consul TF provider's | ||
# resource.consul_acl_token in the same Terraform run, because there's no way to | ||
# get the management token into the provider's environment after we bootstrap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classic conundrum 🙃
it has been a while, but I swear there has been some TF work for multi-stage behavior over the last couple years. I dunno the details, or how much of that made its way into CE/CLI executions (may be cloud only), or if it's really worth chasing down for this when we have a perfectly okay-enough kludge to do the thing 😋
we can do separate TF runs though, since we're in a shelly CI environment, if we want to, someday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny thing is we already do have separate TF runs, because we use TF to get the Vault token. So we could probably split this all out into:
- TF for Vault token
- TF to deploy the infra and upload all the config files
- TF to run the bootstrap ACL scripts for Consul and Nomad
I'd definitely like to make this less miserable, but we're already doing the same thing for Nomad ACLs. So let's revisit that under a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Alternately, we could lean on what our colleagues working on the "Nomad bench" project are doing and just pull in Ansible to do all the non-infra work, which would certainly make the TF easier!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum....I haven't tested it yet, but it doesn't seem like we use consul_*
resources before ACLs are bootstrapped, so I think that we could use a data "http"
to wait until the Consul agent is fully bootstrapped and have provisioner "consul"
depend on it so that any Consul resource is only attempted to be created once the provisioner has been configured with the token.
A few "new" toys that could help here would be
- https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external
- https://registry.terraform.io/providers/hashicorp/http/latest/docs/data-sources/http
Another non-Terraform option would be to pre-generate a bootstrap token and place it in the acl.initial_management
config.
But probably not worth refactoring all of this 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I think that we could use a data
"http"
TIL about data.http
. Could be handy for sure.
Another non-Terraform option would be to pre-generate a bootstrap token and place it in the
We're doing that in this PR, but in my testing I still saw "ACLs haven't been bootstrapped" errors. Maybe I'm doing something wrong though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just remembered an example where I used data.http
to wait for Nomad:
https://gist.github.com/lgfa29/b707d56ace871602cb4955df2a1afad0#file-main-tf-L137-L144
Configuring the provider with the request output allows every other nomad_*
resource to wait until everything is ready.
https://gist.github.com/lgfa29/b707d56ace871602cb4955df2a1afad0#file-main-tf-L5-L8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think there's enough here where we could eliminate these two shell scripts. The data.http
for Nomad might have to get a little creative inasmuch as we can't give it a client TLS cert, but I think we probably want to eliminate the client verification on HTTP TLS anyways so that the cluster is in line with our current security recommendations.
I've got a small pile of minor refactors I want to tackle once I've got my immediate yak shave done. I'll add this to that pile.
4688788
to
4da82cd
Compare
file_permission = "0600" | ||
} | ||
|
||
resource "null_resource" "upload_consul_server_configs" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we use it in several other places, so we don't need to change things now, but null_resource
is being replaced with terraform_data
:
https://developer.hashicorp.com/terraform/language/resources/terraform-data#example-usage-null_resource-replacement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we have a ton of that sort of thing. Happy to take a pass in a separate PR to modernize some of our TF code though for sure.
# We can't both bootstrap the ACLs and use the Consul TF provider's | ||
# resource.consul_acl_token in the same Terraform run, because there's no way to | ||
# get the management token into the provider's environment after we bootstrap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum....I haven't tested it yet, but it doesn't seem like we use consul_*
resources before ACLs are bootstrapped, so I think that we could use a data "http"
to wait until the Consul agent is fully bootstrapped and have provisioner "consul"
depend on it so that any Consul resource is only attempted to be created once the provisioner has been configured with the token.
A few "new" toys that could help here would be
- https://registry.terraform.io/providers/hashicorp/external/latest/docs/data-sources/external
- https://registry.terraform.io/providers/hashicorp/http/latest/docs/data-sources/http
Another non-Terraform option would be to pre-generate a bootstrap token and place it in the acl.initial_management
config.
But probably not worth refactoring all of this 😅
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul token workflow, but they are limited to running single node tests. The E2E cluster is network isolated, so using our HCP Consul cluster runs into a problem validating WI tokens because it can't reach the JWKS endpoint. In real production environments, you'd solve this with a CNAME pointing to a public IP pointing to a proxy with a real domain name. But that's logisitcally impractical for our ephemeral nightly cluster. Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can reach each other. This will allow us to update our Consul tests so they can use Workload Identity, in a separate PR. Ref: #19698
4da82cd
to
f1cd5af
Compare
Our `consulcompat` tests exercise both the Workload Identity and legacy Consul token workflow, but they are limited to running single node tests. The E2E cluster is network isolated, so using our HCP Consul cluster runs into a problem validating WI tokens because it can't reach the JWKS endpoint. In real production environments, you'd solve this with a CNAME pointing to a public IP pointing to a proxy with a real domain name. But that's logisitcally impractical for our ephemeral nightly cluster. Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can reach each other. This will allow us to update our Consul tests so they can use Workload Identity, in a separate PR. Ref: #19698
Our
consulcompat
tests exercise both the Workload Identity and legacy Consul token workflow, but they are limited to running single node tests. The E2E cluster is network isolated, so using our HCP Consul cluster runs into a problem validating WI tokens because it can't reach the JWKS endpoint. In real production environments, you'd solve this with a CNAME pointing to a public IP pointing to a proxy with a real domain name. But that's logistically impractical for our ephemeral nightly cluster.Migrate the HCP Consul to a single-node Consul cluster on AWS EC2 alongside our Nomad cluster. Bootstrap TLS and ACLs in Terraform and ensure all nodes can reach each other. This will allow us to update our Consul tests so they can use Workload Identity, in a separate PR.
An important note here is that right now this only runs Consul Enterprise, so the test runner needs to provide a license file. We only run nightly on Nomad Enterprise anyways so this isn't a huge deal, but in later work it might be nice to be able to run against Consul CE as well.
Ref: #19698