-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker #3947
Comments
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
I also observe something like this in agent status output, which I suspect but didn't confirm is probably a different form of the same problem:
Observed in |
So far I have been unable to reproduce this locally. So I'm going to be adding logging, etc. to try and debug via CI builds where this issue seems to reproduce pretty consistently. Here's by WIP PR for that: #3948. While I'm waiting on CI to run on that PR, I looked through the failing tests with this symptom for the past few runs on elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go Lines 225 to 256 in 7e80290
|
If the agent re-execs fast enough that write might happen soon after the watcher is launched and is also trying to read the file. |
Yes, my working theory is that the Watcher starts up, Agent re-execs at the same time, both try to write to the Upgrade Marker file, for different reasons: the Watcher for writing I don't think the race condition is coming from the old Agent writing the upgrade marker file at the same time as the Watcher. That's because the last write the old Agent makes is before it even starts up the Watcher. |
So the "good" news is that I'm able to reproduce this scenario locally, on my Mac, using two local builds and Fleet spun up with |
Okay, so I've been able to rule out the Fleet ack'ing theory I mentioned in #3947 (comment) and #3947 (comment). I've ruled it out in a couple of different ways. In both cases I was printing out the upgrade marker file every 250ms in a terminal window to see how its contents changed and when.
Interestingly there was a slight difference in the corruption I saw between both scenarios. In 1, the corrupted file looked like this:
In 2, the corrupted file looked like this:
This is making me think something simpler is wrong, perhaps with the YAML serialization or writing the serialized bytes to disk. What doesn't make sense (yet) is why either of these problem might be happening with upgrade details fields but not with the other fields in the upgrade marker. Investigating further... |
The 2nd corrupted file looks the first time the data was written to disk |
That's a good observation, Lee! But how would that explain the first corrupted file, where the corruption is in the middle of the file? |
Do we need to manually truncate the file before writing it? Or just replace the whole file? We don't just want to write, we want to completely replace the file contents. |
Yup, lack of truncation might be the issue. We are not opening the file with
Still, I'd like to know exactly which commit / PR this bug was introduced to try and understand why it's happening. Going to continue bisecting for now but will try the truncation fix at some point for sure. |
Yup, adding the
Now going to try running the |
Well, this test didn't pass but that's to be expected because the non-truncating write in question is happening from the Upgrade Watcher, which is run as part of the new (i.e. upgraded) Agent. And the Agent that's upgraded to in the test is one pulled down from the artifacts API, not the one that's locally built. The PR for the fix is up: #3948. I've added a unit test to catch the truncation bug but, as explained above, the PR is not expected to pass the integration tests. |
Primarily observed on Linux. This is the error we are seeing:
How to reproduce
Run the Fleet managed upgrade test:
It consistently fails on for me.
Some extra information
The process that fails is the new Elastic-Agent, looking at the systemd logs we can see:
There are two odd things there:
Looking at the upgrade marker file before the watcher rollsback the upgrade, we can see it is corrupted:
On all my tests I always got exactly the same output/same corruption.
The function that likely writes this version of the upgrade marker is
elastic-agent/internal/pkg/agent/application/upgrade/step_mark.go
Lines 200 to 216 in 7e80290
That is called by
elastic-agent/internal/pkg/agent/cmd/watch.go
Line 109 in 7e80290
The process who runs those functions is the upgrade watcher, that is the new version the Elastic-Agent is upgrading to.
The text was updated successfully, but these errors were encountered: