Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.9.3: panic: runtime error: slice bounds out of range [:12] with capacity 0 #24441

Closed
HINT-SJ opened this issue Nov 12, 2024 · 7 comments · Fixed by #24442
Closed

Nomad 1.9.3: panic: runtime error: slice bounds out of range [:12] with capacity 0 #24441

HINT-SJ opened this issue Nov 12, 2024 · 7 comments · Fixed by #24442
Assignees
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/workload-identity type/bug

Comments

@HINT-SJ
Copy link

HINT-SJ commented Nov 12, 2024

Nomad version

Nomad v1.9.3
BuildDate 2024-11-11T16:35:41Z
Revision d92bf1014886c0ff9f882f4a2691d5ae8ad8131c

Operating system and Environment details

Amazon Linux 2023 (minimal)
AWS EC2 Gravitron (t4g.small)

Issue

At an attempt to upgrade from Nomad 1.9.1 to 1.9.3 (skipping 1.9.2) the first server node we updated failed to start:

==> Nomad agent started! Log data will stream in below:
    2024-11-12T10:47:11.753Z [INFO]  nomad: setting up raft bolt store: no_freelist_sync=false
    2024-11-12T10:47:11.757Z [INFO]  nomad.raft: initial configuration: index=0 servers=[]
    2024-11-12T10:47:11.758Z [INFO]  nomad.raft: entering follower state: follower="Node at XXXXX.11:4647 [Follower]" leader-address= leader-id=
    2024-11-12T10:47:11.761Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-11.XXXXX.compute.internal.global XXXXX.11
    2024-11-12T10:47:11.761Z [INFO]  nomad: starting scheduling worker(s): num_workers=2 schedulers=["sysbatch", "service", "batch", "system", "_core"]
    2024-11-12T10:47:11.761Z [INFO]  nomad: started scheduling worker(s): num_workers=2 schedulers=["sysbatch", "service", "batch", "system", "_core"]
    2024-11-12T10:47:11.767Z [INFO]  nomad: adding server: server="ip-XXXXX-11.XXXXX.compute.internal.global (Addr: XXXXX.11:4647) (DC: dc1)"
    2024-11-12T10:47:11.774Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-115.XXXXX.compute.internal.global XXXXX.115
    2024-11-12T10:47:11.775Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-105.XXXXX.compute.internal.global XXXXX.105
    2024-11-12T10:47:11.775Z [INFO]  nomad: adding server: server="ip-XXXXX-115.XXXXX.compute.internal.global (Addr: XXXXX.115:4647) (DC: dc1)"
    2024-11-12T10:47:11.776Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-154.XXXXX.compute.internal.global XXXXX.154
    2024-11-12T10:47:11.776Z [INFO]  nomad: adding server: server="ip-XXXXX-105.XXXXX.compute.internal.global (Addr: XXXXX.105:4647) (DC: dc1)"
    2024-11-12T10:47:11.776Z [INFO]  nomad: serf: EventMemberJoin: ip-XXXXX-53.XXXXX.compute.internal.global XXXXX.53
    2024-11-12T10:47:11.779Z [INFO]  nomad: disabling bootstrap mode because existing Raft peers being reported by peer: peer_name=ip-XXXXX-154.XXXXX.compute.internal.global peer_address=10.169>
    2024-11-12T10:47:11.779Z [INFO]  nomad: adding server: server="ip-XXXXX-154.XXXXX.compute.internal.global (Addr: XXXXX.154:4647) (DC: dc1)"
    2024-11-12T10:47:11.779Z [INFO]  nomad: adding server: server="ip-XXXXX-53.XXXXX.compute.internal.global (Addr: XXXXX.53:4647) (DC: dc1)"
    2024-11-12T10:47:11.783Z [INFO]  nomad: successfully contacted Nomad servers: num_servers=4
    2024-11-12T10:47:11.784Z [WARN]  nomad.raft: failed to get previous log: previous-index=5202671 last-index=0 error="log not found"
    2024-11-12T10:47:11.788Z [INFO]  snapshot: creating new snapshot: path=/opt/nomad/data/server/raft/snapshots/11112-5202550-1731408431788.tmp
    2024-11-12T10:47:11.804Z [INFO]  nomad.raft: snapshot network transfer progress: read-bytes=3154303 percent-complete="100.00%"
    2024-11-12T10:47:11.818Z [INFO]  nomad.raft: copied to local snapshot: bytes=3154303
panic: runtime error: slice bounds out of range [:12] with capacity 0
goroutine 96 [running]:
github.com/hashicorp/go-kms-wrapping/v2/aead.(*Wrapper).Decrypt(0x4000a37d40, {0xd085bc?, 0x40009c55c0?}, 0x4000ec1810, {0x0?, 0x40009c55c0?, 0x400008fd78?})
        github.com/hashicorp/go-kms-wrapping/v2@v2.0.16/aead/aead.go:272 +0x1a0
github.com/hashicorp/nomad/nomad.(*Encrypter).decryptWrappedKeyTask.func2()
        github.com/hashicorp/nomad/nomad/encrypter.go:481 +0x68
github.com/hashicorp/nomad/helper.WithBackoffFunc({0x3696740, 0x40007d8e60}, 0x3b9aca00, 0x12a05f200, 0x400008ff28)
        github.com/hashicorp/nomad/helper/backoff.go:50 +0xe0
github.com/hashicorp/nomad/nomad.(*Encrypter).decryptWrappedKeyTask(0x40009401c0, {0x3696740, 0x40007d8e60}, 0x40009c5550, {0x369bf80, 0x4000a37d40}, 0x0?, 0x40007d8ff0, 0x40010ac780)
        github.com/hashicorp/nomad/nomad/encrypter.go:474 +0x11c
created by github.com/hashicorp/nomad/nomad.(*Encrypter).AddWrappedKey in goroutine 102
        github.com/hashicorp/nomad/nomad/encrypter.go:426 +0x480
nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
nomad.service: Failed with result 'exit-code'.

Maybe related to issue #24379 and #24411

We actually rollout new EC2 instances (so no old data left on system) if that helps

Reproduction steps

Update 1.9.1 cluster to 1.9.3 ^^
Cluster has been up for several years and major versions.

Expected Result

Successfully starting a new server node with new version :)

@HINT-SJ
Copy link
Author

HINT-SJ commented Nov 12, 2024

FYI, node rollback to 1.9.1 yields the exact same issue as above:

panic: runtime error: slice bounds out of range [:12] with capacity 0
goroutine 50 [running]:
github.com/hashicorp/go-kms-wrapping/v2/aead.(*Wrapper).Decrypt(0x4000f06000, {0xd085bc?, 0x4000dfa070?}, 0x4000ab8370, {0x0?, 0x4000dfa070?, 0x4000abed78?})
        github.com/hashicorp/go-kms-wrapping/v2@v2.0.16/aead/aead.go:272 +0x1a0
github.com/hashicorp/nomad/nomad.(*Encrypter).decryptWrappedKeyTask.func2()
        github.com/hashicorp/nomad/nomad/encrypter.go:481 +0x68
github.com/hashicorp/nomad/helper.WithBackoffFunc({0x36835c0, 0x4000f02000}, 0x3b9aca00, 0x12a05f200, 0x4000abef28)
        github.com/hashicorp/nomad/helper/backoff.go:50 +0xe0
github.com/hashicorp/nomad/nomad.(*Encrypter).decryptWrappedKeyTask(0x400095d810, {0x36835c0, 0x4000f02000}, 0x4000dfa040, {0x3688d80, 0x4000f06000}, 0x400095d810?, 0x4000f02050, 0x4000a584b0)
        github.com/hashicorp/nomad/nomad/encrypter.go:474 +0x11c
created by github.com/hashicorp/nomad/nomad.(*Encrypter).AddWrappedKey in goroutine 35
        github.com/hashicorp/nomad/nomad/encrypter.go:426 +0x480
nomad.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

@jrasell
Copy link
Member

jrasell commented Nov 12, 2024

Hi @HINT-SJ and thanks for raising this issue and sorry you've hit yet another class of this bug. I have already raised a linked PR to fix this issue along with additional spot-checks to ensure this pattern is not elsewhere. I'll work with the rest of the team to get this merged and look to add some additional tests in this area in the future.

@jrasell jrasell added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/workload-identity hcc/jira labels Nov 12, 2024
@jrasell jrasell moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Nov 12, 2024
@HINT-SJ
Copy link
Author

HINT-SJ commented Nov 12, 2024

Thanks for your continuous work :)
I felt totally stupid when I encountered the "similar" panic again and had to check the version ^^

Fingers crossed!

@jrasell
Copy link
Member

jrasell commented Nov 12, 2024

@HINT-SJ my pleasure. I also had to do a double take on this :D

@bfqrst
Copy link

bfqrst commented Nov 14, 2024

Guys, is there any timeline on when this fix will be shipped? Our cluster(s) are running thin up to the point where some can't elect a leader anymore. Since we can't add new server, we're caught between a rock and a hard place! I am not a paying customer and I'm fully aware that I'm using an open core product! Having that said, can somebody share a roadmap on this? Thanks

@blalor
Copy link
Contributor

blalor commented Nov 18, 2024

@bfqrst there are a couple of workarounds (building from source, downloading an artifact from CI) discussed on #24442.

@bfqrst
Copy link

bfqrst commented Nov 18, 2024

Thanks @blalor, we did end up doing exactly that... Plus some lessons learned along the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/workload-identity type/bug
Projects
Development

Successfully merging a pull request may close this issue.

4 participants