[8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker #3947

cmacknz · 2023-12-21T15:35:34Z

Primarily observed on Linux. This is the error we are seeing:

Error: error checking for and handling upgrade: unable to load upgrade marker to check if Agent is being upgraded: yaml: line 17: mapping values are not allowed in this context

How to reproduce

Run the Fleet managed upgrade test:

AGENT_KEEP_INSTALLED=true SNAPSHOT=true TEST_PLATFORMS="linux/amd64" mage -v integration:single TestFleetManagedUpgrade

It consistently fails on for me.

Some extra information

The process that fails is the new Elastic-Agent, looking at the systemd logs we can see:

Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Error: error checking for and handling upgrade: unable to load upgrade marker to check if Agent is being upgraded: yaml: line 17: mapping values are not allowed in this context
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Error: error checking for and handling upgrade: unable to load upgrade marker to check if Agent is being upgraded: yaml: line 17: mapping values are not allowed in this context
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Usage:
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   elastic-agent [subcommand] [flags]
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   elastic-agent [command]
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Available Commands:
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   completion  Generate the autocompletion script for the specified shell
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   component   Tools to work on components
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   diagnostics Gather diagnostics information from the Elastic Agent and write it to a zip archive
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   enroll      Enroll the Elastic Agent into Fleet
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   help        Help about any command
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   inspect     Show current configuration of the Elastic Agent
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   install     Install Elastic Agent permanently on this system
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   logs        Output Elastic Agent logs
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   restart     Restart the currently running Elastic Agent daemon
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   run         Start the Elastic Agent
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   status      Show the current status of the running Elastic Agent daemon
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   uninstall   Uninstall Elastic Agent from this system
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   upgrade     Upgrade the currently installed Elastic Agent to the specified version
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   version     Display the version of the elastic-agent.
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   watch       Watch the Elastic Agent for failures and initiate rollback
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Flags:
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -c, --c string                     Configuration file, relative to path.config (default "elastic-agent.yml")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --config string                Configuration file, relative to path.config (default "elastic-agent.yml")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -d, --d string                     Enable certain debug selectors
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -e, --e                            Log to stderr and disable syslog/file output
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --environment environmentVar   set environment being ran in (default default)
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -h, --help                         help for elastic-agent
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.config string           Config path is the directory Agent looks for its config file (default "/opt/Elastic/Agent")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.downloads string        Downloads path contains binaries Agent downloads (default "/opt/Elastic/Agent/data/elastic-agent-53dce0/downloads")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.home string             Agent root path (default "/opt/Elastic/Agent")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.install string          DEPRECATED, setting this flag has no effect since v8.6.0
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:       --path.logs string             Logs path contains Agent log output (default "/opt/Elastic/Agent")
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]:   -v, --v                            Log at INFO level
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: Use "elastic-agent [command] --help" for more information about a command.
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e elastic-agent[65235]: error checking for and handling upgrade: unable to load upgrade marker to check if Agent is being upgraded: yaml: line 17: mapping values are not allowed in this context
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: elastic-agent.service: Failed with result 'exit-code'.
Dec 21 15:13:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: elastic-agent.service: Unit process 65049 (elastic-agent) remains running after unit stopped.
Dec 21 15:15:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: elastic-agent.service: Scheduled restart job, restart counter is at 2.
Dec 21 15:15:49 ogc-linux-amd64-ubuntu-2204-fleet-331e systemd[1]: Stopped Elastic Agent is a unified agent to observe, monitor and protect your system..

There are two odd things there:

The error shown at the top of this issue
The help output like the Elastic-Agent was run with a wrong CLI argument.

Looking at the upgrade marker file before the watcher rollsback the upgrade, we can see it is corrupted:

hash: 7e8029
updated_on: 2023-12-21T14:14:20.80794607Z
prev_version: 8.13.0
prev_hash: 53dce0
acked: false
action:
  id: fd49bdb0-d30c-48ff-b411-5d98ea3525d7
  type: UPGRADE
  version: 8.13.0-SNAPSHOT
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  action_id: fd49bdb0-d30c-48ff-b411-5d98ea3525d7
  metadata:
    retry_until: null
_percent: 1
    download_rate: .inf
    retry_until: null

On all my tests I always got exactly the same output/same corruption.

The function that likely writes this version of the upgrade marker is

elastic-agent/internal/pkg/agent/application/upgrade/step_mark.go

Lines 200 to 216 in 7e80290

    
           func SaveMarker(marker *UpdateMarker, shouldFsync bool) error { 
        
           	makerSerializer := &updateMarkerSerializer{ 
        
           		Hash:        marker.Hash, 
        
           		UpdatedOn:   marker.UpdatedOn, 
        
           		PrevVersion: marker.PrevVersion, 
        
           		PrevHash:    marker.PrevHash, 
        
           		Acked:       marker.Acked, 
        
           		Action:      convertToMarkerAction(marker.Action), 
        
           		Details:     marker.Details, 
        
           	} 
        
           	markerBytes, err := yaml.Marshal(makerSerializer) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	return writeMarkerFile(markerFilePath(), markerBytes, shouldFsync) 
        
           }

That is called by

elastic-agent/internal/pkg/agent/cmd/watch.go

Line 109 in 7e80290

upgradeDetails := initUpgradeDetails(marker, upgrade.SaveMarker, log)

The process who runs those functions is the upgrade watcher, that is the new version the Elastic-Agent is upgrading to.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-12-21T15:35:36Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

cmacknz · 2023-12-21T16:32:11Z

I also observe something like this in agent status output, which I suspect but didn't confirm is probably a different form of the same problem:

    upgrade_fleet_test.go:248: Waiting for upgrade watcher to start...
    upgrade_fleet_test.go:251: Upgrade watcher started
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    upgrader.go:434: waiting for healthy agent and proper version: could not unmarshal agent status output: error: exit status 1, output: Error: failed to communicate with Elastic Agent daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused"
        For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html
        
        invalid character 'E' looking for beginning of value
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    upgrader.go:434: waiting for healthy agent and proper version: could not unmarshal agent status output: error: exit status 1, output: Error: failed to communicate with Elastic Agent daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused"
        For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.12/fleet-troubleshooting.html

Observed in TestFleetAirGappedUpgrade in https://buildkite.com/elastic/elastic-agent/builds/5716

ycombinator · 2023-12-21T16:37:37Z

So far I have been unable to reproduce this locally. So I'm going to be adding logging, etc. to try and debug via CI builds where this issue seems to reproduce pretty consistently. Here's by WIP PR for that: #3948.

While I'm waiting on CI to run on that PR, I looked through the failing tests with this symptom for the past few runs on main. The one pattern I'm seeing is that this issue seems to be showing up only in Fleet-managed upgrades; there are some non-Fleet-managed upgrade tests failing too but they are not showing the same symptom. This makes me suspect the race might involve the Ack'ing code, which is only exercised in Fleet-managed mode, and which does write to the Upgrade Marker file.

elastic-agent/internal/pkg/agent/application/upgrade/upgrade.go

Lines 225 to 256 in 7e80290

    
           // Ack acks last upgrade action 
        
           func (u *Upgrader) Ack(ctx context.Context, acker acker.Acker) error { 
        
           	// get upgrade action 
        
           	marker, err := LoadMarker() 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	if marker == nil { 
        
           		return nil 
        
           	} 
        
           	if marker.Acked { 
        
           		return nil 
        
           	} 
        
           	// Action can be nil if the upgrade was called locally. 
        
           	// Should handle gracefully 
        
           	// https://github.com/elastic/elastic-agent/issues/1788 
        
           	if marker.Action != nil { 
        
           		if err := acker.Ack(ctx, marker.Action); err != nil { 
        
           			return err 
        
           		} 
        
           		if err := acker.Commit(ctx); err != nil { 
        
           			return err 
        
           		} 
        
           	} 
        
           	marker.Acked = true 
        
           	return SaveMarker(marker, false) 
        
           }

cmacknz · 2023-12-21T16:42:40Z

If the agent re-execs fast enough that write might happen soon after the watcher is launched and is also trying to read the file.

ycombinator · 2023-12-21T17:31:58Z

If the agent re-execs fast enough that write might happen soon after the watcher is launched and is also trying to read the file.

Yes, my working theory is that the Watcher starts up, Agent re-execs at the same time, both try to write to the Upgrade Marker file, for different reasons: the Watcher for writing UPG_WATCHING and the new Agent for acknowledging the upgrade (this happens very early on in the Coordinator's Run method).

I don't think the race condition is coming from the old Agent writing the upgrade marker file at the same time as the Watcher. That's because the last write the old Agent makes is before it even starts up the Watcher.

ycombinator · 2023-12-21T19:54:39Z

So the "good" news is that I'm able to reproduce this scenario locally, on my Mac, using two local builds and Fleet spun up with elastic-package. Hopefully now I can iterate pretty quickly on a potential fix.

ycombinator · 2023-12-21T20:23:17Z

Okay, so I've been able to rule out the Fleet ack'ing theory I mentioned in #3947 (comment) and #3947 (comment). I've ruled it out in a couple of different ways. In both cases I was printing out the upgrade marker file every 250ms in a terminal window to see how its contents changed and when.

I used builds of Agent where I commented out the code where the Ack function was writing the upgrade marker to disk. Despite doing this, I saw the same symptoms of corruption in the upgrade marker file right after the upgrade watcher started up.
I upgraded a standalone agent using --source-uri=file://... and saw the same symptoms of corruption in the upgrade marker file right after the upgrade watcher started up.

Interestingly there was a slight difference in the corruption I saw between both scenarios.

In 1, the corrupted file looked like this:

hash: "793184"
updated_on: 2023-12-21T12:12:45.029749-08:00
prev_version: 8.11.0
prev_hash: f8671d
acked: false
action:
  id: e519c3ac-d47c-4f50-be7f-92853037db42
  type: UPGRADE
  version: 8.13.0-SNAPSHOT
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  action_id: e519c3ac-d47c-4f50-be7f-92853037db42
  metadata:
    retry_until: null
_percent: 1
    download_rate: .inf
    retry_until: null

In 2, the corrupted file looked like this:

Thu Dec 21 20:16:17 UTC 2023
hash: "793184"
updated_on: 2023-12-21T12:16:16.51586-08:00
prev_version: 8.11.0
prev_hash: f8671d
acked: false
action: null
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  metadata:
    retry_until: null
ntil: null

This is making me think something simpler is wrong, perhaps with the YAML serialization or writing the serialized bytes to disk. What doesn't make sense (yet) is why either of these problem might be happening with upgrade details fields but not with the other fields in the upgrade marker. Investigating further...

leehinman · 2023-12-21T20:59:28Z

This is making me think something simpler is wrong, perhaps with the YAML serialization or writing the serialized bytes to disk. What doesn't make sense (yet) is why either of these problem might be happening with upgrade details fields but not with the other fields in the upgrade marker. Investigating further...

The 2nd corrupted file looks the first time the data was written to disk metadata had another line that was 10 characters long, then we did a write over the top of whole file that was shorter in length. That would explain why the corruption only happens at the end of the file, and why "ntil: null" and " retry_until: null" overlap. But to do that we would either have to not update EOF or write the contents of a buffer twice where we used the old length.

ycombinator · 2023-12-21T21:07:37Z

That's a good observation, Lee! But how would that explain the first corrupted file, where the corruption is in the middle of the file?

ycombinator · 2023-12-21T21:13:36Z

So, some good news. I found that this bug does not exist as of commit ~~d80cadd~~ ~~8082519~~ ~~76ce7f1~~ cc81e4b. Bisecting to figure out where it was introduced after that commit.

cmacknz · 2023-12-21T21:16:33Z

The 2nd corrupted file looks the first time the data was written to disk metadata had another line that was 10 characters long, then we did a write over the top of whole file that was shorter in length.

Do we need to manually truncate the file before writing it? Or just replace the whole file?

We don't just want to write, we want to completely replace the file contents.

ycombinator · 2023-12-21T21:19:54Z

Yup, lack of truncation might be the issue. We are not opening the file with os.O_TRUNC:

elastic-agent/internal/pkg/agent/application/upgrade/marker_access_common.go

Line 13 in 7e80290

f, err := os.OpenFile(markerFile, os.O_WRONLY|os.O_CREATE, 0600)

Still, I'd like to know exactly which commit / PR this bug was introduced to try and understand why it's happening. Going to continue bisecting for now but will try the truncation fix at some point for sure.

ycombinator · 2023-12-21T21:40:22Z

Bug was introduced in ad7e1b5 / #3827 (shocker! 🙄).

Trying the truncation fix now before diving into exactly where in the above change the bug was introduced.

ycombinator · 2023-12-21T21:51:11Z

Yup, adding the os.O_TRUNC flag when opening the file solved the issue 🙂. Thanks @leehinman and @cmacknz for the assist!

$ while true; do date -u; cat .update-marker; echo '---'; sleep 0.25; done
Thu Dec 21 21:48:27 UTC 2023
cat: .update-marker: No such file or directory
---
Thu Dec 21 21:48:27 UTC 2023
cat: .update-marker: No such file or directory
---
Thu Dec 21 21:48:27 UTC 2023
hash: a50489
updated_on: 2023-12-21T13:48:27.676782-08:00
prev_version: 8.11.0
prev_hash: 878b50
acked: false
action: null
details:
  target_version: 8.13.0-SNAPSHOT
  state: UPG_REPLACING
  metadata:
    retry_until: null
---
Thu Dec 21 21:48:28 UTC 2023
hash: a50489
updated_on: 2023-12-21T13:48:27.676782-08:00
prev_version: 8.11.0
prev_hash: 878b50
acked: false
action: null
details:
  target_version: 8.13.0-SNAPSHOT
  state: UPG_REPLACING
  metadata:
    retry_until: null
---
Thu Dec 21 21:48:28 UTC 2023
hash: a50489
updated_on: 2023-12-21T13:48:27.676782-08:00
prev_version: 8.11.0
prev_hash: 878b50
acked: false
action: null
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  metadata:
    retry_until: null
---
Thu Dec 21 21:48:28 UTC 2023
hash: a50489
updated_on: 2023-12-21T13:48:27.676782-08:00
prev_version: 8.11.0
prev_hash: 878b50
acked: false
action: null
details:
  target_version: 8.13.0
  state: UPG_WATCHING
  metadata:
    retry_until: null
---

Now going to try running the TestFleetManagedUpgrade E2E test and see if it passes.

ycombinator · 2023-12-21T22:26:45Z

Now going to try running the TestFleetManagedUpgrade E2E test and see if it passes.

Well, this test didn't pass but that's to be expected because the non-truncating write in question is happening from the Upgrade Watcher, which is run as part of the new (i.e. upgraded) Agent. And the Agent that's upgraded to in the test is one pulled down from the artifacts API, not the one that's locally built.

The PR for the fix is up: #3948. I've added a unit test to catch the truncation bug but, as explained above, the PR is not expected to pass the integration tests.

cmacknz added bug Something isn't working Team:Elastic-Agent Label for the Agent team v8.12.0 labels Dec 21, 2023

cmacknz assigned ycombinator Dec 21, 2023

ycombinator changed the title ~~[8.12] Upgrades being unconditionally rolledback because the agent cannot parse the upgrade market~~ [8.12] Upgrades being unconditionally rolledback because the agent cannot parse the upgrade marker Dec 21, 2023

cmacknz changed the title ~~[8.12] Upgrades being unconditionally rolledback because the agent cannot parse the upgrade marker~~ [8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker Dec 21, 2023

belimawr mentioned this issue Dec 21, 2023

Fix install/enroll command not failing when the daemon restart fails #3815

Merged

6 tasks

belimawr mentioned this issue Dec 21, 2023

[E2E Testing] mage targets to interact with VMs and cloud stack #3887

Merged

3 tasks

ycombinator mentioned this issue Dec 21, 2023

[Upgrade Details] Fix corruption when writing upgrade marker file #3948

Merged

7 tasks

pierrehilbert closed this as completed in #3948 Dec 22, 2023

pierrehilbert mentioned this issue Dec 22, 2023

[8.12](backport #3948) [Upgrade Details] Fix corruption when writing upgrade marker file #3951

Merged

ycombinator mentioned this issue Dec 22, 2023

Safe rotate upgrade marker file whenever it's written #3953

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker #3947

[8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker #3947

cmacknz commented Dec 21, 2023 •

edited by belimawr

Loading

elasticmachine commented Dec 21, 2023

cmacknz commented Dec 21, 2023

ycombinator commented Dec 21, 2023

cmacknz commented Dec 21, 2023

ycombinator commented Dec 21, 2023 •

edited

Loading

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023 •

edited

Loading

leehinman commented Dec 21, 2023

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023 •

edited

Loading

cmacknz commented Dec 21, 2023

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023

[8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker #3947

[8.12] Upgrades being incorrectly rolled back because the agent cannot parse the upgrade marker #3947

Comments

cmacknz commented Dec 21, 2023 • edited by belimawr Loading

How to reproduce

Some extra information

elasticmachine commented Dec 21, 2023

cmacknz commented Dec 21, 2023

ycombinator commented Dec 21, 2023

cmacknz commented Dec 21, 2023

ycombinator commented Dec 21, 2023 • edited Loading

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023 • edited Loading

leehinman commented Dec 21, 2023

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023 • edited Loading

cmacknz commented Dec 21, 2023

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023

ycombinator commented Dec 21, 2023

cmacknz commented Dec 21, 2023 •

edited by belimawr

Loading

ycombinator commented Dec 21, 2023 •

edited

Loading

ycombinator commented Dec 21, 2023 •

edited

Loading

ycombinator commented Dec 21, 2023 •

edited

Loading