Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker #3947

Closed
cmacknz opened this issue Dec 21, 2023 · 15 comments · Fixed by #3948
Closed
Assignees
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.12.0

Comments

@cmacknz
Copy link
Member

cmacknz commented Dec 21, 2023

Primarily observed on Linux. This is the error we are seeing:

Error: error checking for and handling upgrade: unable to load upgrade marker to check if Agent is being upgraded: yaml: line 17: mapping values are not allowed in this context

How to reproduce

Run the Fleet managed upgrade test:

AGENT_KEEP_INSTALLED=true SNAPSHOT=true TEST_PLATFORMS="linux/amd64" mage -v integration:single TestFleetManagedUpgrade

It consistently fails on for me.

Some extra information

The process that fails is the new Elastic-Agent, looking at the systemd logs we can see:

Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Error: error checking for and handling upgrade: unable to load upgrade marker to check if Agent is being upgraded: yaml: line 17: mapping values are not allowed in this context
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Error: error checking for and handling upgrade: unable to load upgrade marker to check if Agent is being upgraded: yaml: line 17: mapping values are not allowed in this context
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Usage:
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   elastic-agent [subcommand] [flags]
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   elastic-agent [command]
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Available Commands:
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   completion  Generate the autocompletion script for the specified shell
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   component   Tools to work on components
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   diagnostics Gather diagnostics information from the Elastic Agent and write it to a zip archive
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   enroll      Enroll the Elastic Agent into Fleet
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   help        Help about any command
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   inspect     Show current configuration of the Elastic Agent
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   install     Install Elastic Agent permanently on this system
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   logs        Output Elastic Agent logs
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   restart     Restart the currently running Elastic Agent daemon
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   run         Start the Elastic Agent
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   status      Show the current status of the running Elastic Agent daemon
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   uninstall   Uninstall Elastic Agent from this system
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   upgrade     Upgrade the currently installed Elastic Agent to the specified version
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   version     Display the version of the elastic-agent.
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   watch       Watch the Elastic Agent for failures and initiate rollback
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Flags:
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -c, --c string                     Configuration file, relative to path.config (default "elastic-agent.yml")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --config string                Configuration file, relative to path.config (default "elastic-agent.yml")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -d, --d string                     Enable certain debug selectors
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -e, --e                            Log to stderr and disable syslog/file output
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --environment environmentVar   set environment being ran in (default default)
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -h, --help                         help for elastic-agent
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.config string           Config path is the directory Agent looks for its config file (default "/opt/Elastic/Agent")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.downloads string        Downloads path contains binaries Agent downloads (default "/opt/Elastic/Agent/data/elastic-agent-53dce0/downloads")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.home string             Agent root path (default "/opt/Elastic/Agent")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.install string          DEPRECATED, setting this flag has no effect since v8.6.0
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.logs string             Logs path contains Agent log output (default "/opt/Elastic/Agent")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -v, --v                            Log at INFO level
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Use "elastic-agent [command] --help" for more information about a command.
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: error checking for and handling upgrade: unable to load upgrade marker to check if Agent is being upgraded: yaml: line 17: mapping values are not allowed in this context
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: elastic-agent.service: Unit process 65049 (elastic-agent) remains running after unit stopped.
Dec 21 15:15:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: elastic-agent.service: Scheduled restart job, restart counter is at 2.
Dec 21 15:15:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..

There are two odd things there:

  1. The error shown at the top of this issue
  2. The help output like the Elastic-Agent was run with a wrong CLI argument.

Looking at the upgrade marker file before the watcher rollsback the upgrade, we can see it is corrupted:

hash: 7e8029
updated_on: 2023-12-21T14:14:20.80794607Z
prev_version: 8.13.0
prev_hash: 53dce0
acked: false
action:
  id: fd49bdb0-d30c-48ff-b411-5d98ea3525d7
  type: UPGRADE
  version: 8.13.0-SNAPSHOT
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  action_id: fd49bdb0-d30c-48ff-b411-5d98ea3525d7
  metadata:
    retry_until: null
_percent: 1
    download_rate: .inf
    retry_until: null

On all my tests I always got exactly the same output/same corruption.

The function that likely writes this version of the upgrade marker is

func SaveMarker(marker *UpdateMarker, shouldFsync bool) error {
makerSerializer := &updateMarkerSerializer{
Hash: marker.Hash,
UpdatedOn: marker.UpdatedOn,
PrevVersion: marker.PrevVersion,
PrevHash: marker.PrevHash,
Acked: marker.Acked,
Action: convertToMarkerAction(marker.Action),
Details: marker.Details,
}
markerBytes, err := yaml.Marshal(makerSerializer)
if err != nil {
return err
}
return writeMarkerFile(markerFilePath(), markerBytes, shouldFsync)
}

That is called by

upgradeDetails := initUpgradeDetails(marker, upgrade.SaveMarker, log)

The process who runs those functions is the upgrade watcher, that is the new version the Elastic-Agent is upgrading to.

@cmacknz cmacknz added bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.12.0 labels Dec 21, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@ycombinator ycombinator changed the title [8.12] Upgrades being unconditionally rolledback because the agent cannot parse the upgrade market [8.12] Upgrades being unconditionally rolledback because the agent cannot parse the upgrade marker Dec 21, 2023
@cmacknz cmacknz changed the title [8.12] Upgrades being unconditionally rolledback because the agent cannot parse the upgrade marker [8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker Dec 21, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Dec 21, 2023

I also observe something like this in agent status output, which I suspect but didn't confirm is probably a different form of the same problem:

    upgrade_fleet_test.go:248: Waiting for upgrade watcher to start...
    upgrade_fleet_test.go:251: Upgrade watcher started
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    upgrader.go:434: waiting for healthy agent and proper version: could not unmarshal agent status output: error: exit status 1, output: Error: failed to communicate with Elastic Agent daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused"
        For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
        
        invalid character 'E' looking for beginning of value
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    upgrader.go:434: waiting for healthy agent and proper version: could not unmarshal agent status output: error: exit status 1, output: Error: failed to communicate with Elastic Agent daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused"
        For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html

Observed in TestFleetAirGappedUpgrade in https://buildkite.com/elastic/elastic-agent/builds/5716

@ycombinator
Copy link
Contributor

So far I have been unable to reproduce this locally. So I'm going to be adding logging, etc. to try and debug via CI builds where this issue seems to reproduce pretty consistently. Here's by WIP PR for that: #3948.

While I'm waiting on CI to run on that PR, I looked through the failing tests with this symptom for the past few runs on main. The one pattern I'm seeing is that this issue seems to be showing up only in Fleet-managed upgrades; there are some non-Fleet-managed upgrade tests failing too but they are not showing the same symptom. This makes me suspect the race might involve the Ack'ing code, which is only exercised in Fleet-managed mode, and which does write to the Upgrade Marker file.

// Ack acks last upgrade action
func (u *Upgrader) Ack(ctx context.Context, acker acker.Acker) error {
// get upgrade action
marker, err := LoadMarker()
if err != nil {
return err
}
if marker == nil {
return nil
}
if marker.Acked {
return nil
}
// Action can be nil if the upgrade was called locally.
// Should handle gracefully
// https://github.com/elastic/elastic-agent/issues/1788
if marker.Action != nil {
if err := acker.Ack(ctx, marker.Action); err != nil {
return err
}
if err := acker.Commit(ctx); err != nil {
return err
}
}
marker.Acked = true
return SaveMarker(marker, false)
}

@cmacknz
Copy link
Member Author

cmacknz commented Dec 21, 2023

If the agent re-execs fast enough that write might happen soon after the watcher is launched and is also trying to read the file.

@ycombinator
Copy link
Contributor

ycombinator commented Dec 21, 2023

If the agent re-execs fast enough that write might happen soon after the watcher is launched and is also trying to read the file.

Yes, my working theory is that the Watcher starts up, Agent re-execs at the same time, both try to write to the Upgrade Marker file, for different reasons: the Watcher for writing UPG_WATCHING and the new Agent for acknowledging the upgrade (this happens very early on in the Coordinator's Run method).

I don't think the race condition is coming from the old Agent writing the upgrade marker file at the same time as the Watcher. That's because the last write the old Agent makes is before it even starts up the Watcher.

@ycombinator
Copy link
Contributor

So the "good" news is that I'm able to reproduce this scenario locally, on my Mac, using two local builds and Fleet spun up with elastic-package. Hopefully now I can iterate pretty quickly on a potential fix.

@ycombinator
Copy link
Contributor

ycombinator commented Dec 21, 2023

Okay, so I've been able to rule out the Fleet ack'ing theory I mentioned in #3947 (comment) and #3947 (comment). I've ruled it out in a couple of different ways. In both cases I was printing out the upgrade marker file every 250ms in a terminal window to see how its contents changed and when.

  1. I used builds of Agent where I commented out the code where the Ack function was writing the upgrade marker to disk. Despite doing this, I saw the same symptoms of corruption in the upgrade marker file right after the upgrade watcher started up.
  2. I upgraded a standalone agent using --source-uri=file://... and saw the same symptoms of corruption in the upgrade marker file right after the upgrade watcher started up.

Interestingly there was a slight difference in the corruption I saw between both scenarios.

In 1, the corrupted file looked like this:

hash: "793184"
updated_on: 2023-12-21T12:12:45.029749-08:00
prev_version: 8.11.0
prev_hash: f8671d
acked: false
action:
  id: e519c3ac-d47c-4f50-be7f-92853037db42
  type: UPGRADE
  version: 8.13.0-SNAPSHOT
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  action_id: e519c3ac-d47c-4f50-be7f-92853037db42
  metadata:
    retry_until: null
_percent: 1
    download_rate: .inf
    retry_until: null

In 2, the corrupted file looked like this:

Thu Dec 21 20:16:17 UTC 2023
hash: "793184"
updated_on: 2023-12-21T12:16:16.51586-08:00
prev_version: 8.11.0
prev_hash: f8671d
acked: false
action: null
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  metadata:
    retry_until: null
ntil: null

This is making me think something simpler is wrong, perhaps with the YAML serialization or writing the serialized bytes to disk. What doesn't make sense (yet) is why either of these problem might be happening with upgrade details fields but not with the other fields in the upgrade marker. Investigating further...

@leehinman
Copy link
Contributor

This is making me think something simpler is wrong, perhaps with the YAML serialization or writing the serialized bytes to disk. What doesn't make sense (yet) is why either of these problem might be happening with upgrade details fields but not with the other fields in the upgrade marker. Investigating further...

The 2nd corrupted file looks the first time the data was written to disk metadata had another line that was 10 characters long, then we did a write over the top of whole file that was shorter in length. That would explain why the corruption only happens at the end of the file, and why "ntil: null" and " retry_until: null" overlap. But to do that we would either have to not update EOF or write the contents of a buffer twice where we used the old length.

@ycombinator
Copy link
Contributor

That's a good observation, Lee! But how would that explain the first corrupted file, where the corruption is in the middle of the file?

@ycombinator
Copy link
Contributor

ycombinator commented Dec 21, 2023

So, some good news. I found that this bug does not exist as of commit d80cadd 8082519 76ce7f1 cc81e4b. Bisecting to figure out where it was introduced after that commit.

@cmacknz
Copy link
Member Author

cmacknz commented Dec 21, 2023

The 2nd corrupted file looks the first time the data was written to disk metadata had another line that was 10 characters long, then we did a write over the top of whole file that was shorter in length.

Do we need to manually truncate the file before writing it? Or just replace the whole file?

We don't just want to write, we want to completely replace the file contents.

@ycombinator
Copy link
Contributor

Yup, lack of truncation might be the issue. We are not opening the file with os.O_TRUNC:

f, err := os.OpenFile(markerFile, os.O_WRONLY|os.O_CREATE, 0600)

Still, I'd like to know exactly which commit / PR this bug was introduced to try and understand why it's happening. Going to continue bisecting for now but will try the truncation fix at some point for sure.

@ycombinator
Copy link
Contributor

Bug was introduced in ad7e1b5 / #3827 (shocker! 🙄).

Trying the truncation fix now before diving into exactly where in the above change the bug was introduced.

@ycombinator
Copy link
Contributor

Yup, adding the os.O_TRUNC flag when opening the file solved the issue 🙂. Thanks @leehinman and @cmacknz for the assist!

$ while true; do date -u; cat .update-marker; echo '---'; sleep 0.25; done
Thu Dec 21 21:48:27 UTC 2023
cat: .update-marker: No such file or directory
---
Thu Dec 21 21:48:27 UTC 2023
cat: .update-marker: No such file or directory
---
Thu Dec 21 21:48:27 UTC 2023
hash: a50489
updated_on: 2023-12-21T13:48:27.676782-08:00
prev_version: 8.11.0
prev_hash: 878b50
acked: false
action: null
details:
  target_version: 8.13.0-SNAPSHOT
  state: UPG_REPLACING
  metadata:
    retry_until: null
---
Thu Dec 21 21:48:28 UTC 2023
hash: a50489
updated_on: 2023-12-21T13:48:27.676782-08:00
prev_version: 8.11.0
prev_hash: 878b50
acked: false
action: null
details:
  target_version: 8.13.0-SNAPSHOT
  state: UPG_REPLACING
  metadata:
    retry_until: null
---
Thu Dec 21 21:48:28 UTC 2023
hash: a50489
updated_on: 2023-12-21T13:48:27.676782-08:00
prev_version: 8.11.0
prev_hash: 878b50
acked: false
action: null
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  metadata:
    retry_until: null
---
Thu Dec 21 21:48:28 UTC 2023
hash: a50489
updated_on: 2023-12-21T13:48:27.676782-08:00
prev_version: 8.11.0
prev_hash: 878b50
acked: false
action: null
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  metadata:
    retry_until: null
---

Now going to try running the TestFleetManagedUpgrade E2E test and see if it passes.

@ycombinator
Copy link
Contributor

Now going to try running the TestFleetManagedUpgrade E2E test and see if it passes.

Well, this test didn't pass but that's to be expected because the non-truncating write in question is happening from the Upgrade Watcher, which is run as part of the new (i.e. upgraded) Agent. And the Agent that's upgraded to in the test is one pulled down from the artifacts API, not the one that's locally built.

The PR for the fix is up: #3948. I've added a unit test to catch the truncation bug but, as explained above, the PR is not expected to pass the integration tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.12.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants