Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the OVERRIDE_AGENT_PACKAGE_VERSION value as 8.13 is now built #3901

Merged
merged 20 commits into from
Jan 9, 2024

Conversation

pierrehilbert
Copy link
Contributor

What does this PR do?

Why is it important?

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@pierrehilbert pierrehilbert changed the title Remove OVERRIDE_AGENT_PACKAGE_VERSION Remove the OVERRIDE_AGENT_PACKAGE_VERSION value as 8.13 is now built Dec 12, 2023
@cmacknz
Copy link
Member

cmacknz commented Jan 3, 2024

Failing with:


libcloud.common.google.GoogleBaseError: "The zone 'projects/elastic-platform-ingest/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'."
--
  | 2024-01-03 12:05:05 EST | 2024-01-03T17:04:46Z <Greenlet at 0x7f69ae4c96c0: _up_async(<LayoutModel: 14>)> failed with GoogleBaseError

@cmacknz
Copy link
Member

cmacknz commented Jan 5, 2024

The tests failures here are definitely suspicious:

upgrader.go:231: Upgrading from version "7.17.16" to version "8.13.0-SNAPSHOT"
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent upgrade 8.13.0-SNAPSHOT --source-uri file:///home/ubuntu/agent/.agent-testing/local --skip-verify]
    upgrader.go:275: waiting for upgrade watcher to start
    upgrader.go:280: upgrade watcher started
    upgrader.go:285: Checking upgrade details state while Upgrade Watcher is running
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent version --binary-only --yaml]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    upgrade_standalone_test.go:51: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_standalone_test.go:51
        	Error:      	Received unexpected error:
        	            	failed waiting for status: context deadline exceeded
        	            	could not unmarshal agent status output: error: context deadline exceeded, output: 
        	            	unexpected end of JSON input
        	Test:       	TestStandaloneUpgrade/Upgrade_7.17.16_to_8.13.0-SNAPSHOT
    fixture_install.go:127: [test TestStandaloneUpgrade/Upgrade_7.17.16_to_8.13.0-SNAPSHOT] Inside fixture cleanup function
    fixture_install.go:142: collecting diagnostics; test failed
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestStandaloneUpgrade-Upgrade_7.17.16_to_8.13.0-SNAPSHOT-diagnostics-2024-01-05T09:35:37Z.zip]
    fixture.go:615: >> running binary with: [/opt/Elastic/Agent/elastic-agent uninstall --force]

@cmacknz
Copy link
Member

cmacknz commented Jan 5, 2024

The test failure here reproduces when I run this test locally, so there is something legitimately wrong here.

@ycombinator anything come to mind? This feels like something we already fixed. I'm also not sure why renaming the agent from 8.12.0-SNAPSHOT to 8.13.0-SNAPSHOT would change anything (since that is the only real change here from the agent perspective):

    upgrade_standalone_test.go:51: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_standalone_test.go:51
        	Error:      	Received unexpected error:
        	            	failed waiting for status: context deadline exceeded
        	            	could not unmarshal agent status output: error: context deadline exceeded, output: 
        	            	unexpected end of JSON input
        	Test:       	TestStandaloneUpgrade/Upgrade_7.17.16_to_8.13.0-SNAPSHOT

Unless /opt/Elastic/Agent/elastic-agent upgrade 8.13.0-SNAPSHOT --source-uri file:///home/ubuntu/agent/.agent-testing/local --skip-verify is actually downloading 8.13.0-SNAPSHOT and that is out of date. That shouldn't cause this failure though.

@ycombinator
Copy link
Contributor

ycombinator commented Jan 5, 2024

I'm also not sure why renaming the agent from 8.12.0-SNAPSHOT to 8.13.0-SNAPSHOT would change anything (since that is the only real change here from the agent perspective).

I think this can be explained by the fact that the version being upgraded to (endVersion in the test code) would be 8.13.0, which would then (rightfully) cause this flag is being set to false:

// For asserting on the effects of any Upgrade Watcher changes made in 8.13.0, we need
// the endVersion to be >= 8.13.0. Otherwise, these assertions will fail as those changes
// won't be present in the Upgrade Watcher. So we disable these assertions if the endVersion
// is < 8.13.0.
endVersion, err := version.ParseVersion(endVersionInfo.Binary.Version)
if err != nil {
return fmt.Errorf("failed to parse version of upgraded Agent binary: %w", err)
}
upgradeOpts.disableUpgradeWatcherUpgradeDetailsCheck = upgradeOpts.disableUpgradeWatcherUpgradeDetailsCheck ||
endVersion.Less(*version.NewParsedSemVer(8, 13, 0, "", ""))

And that, in turn, would cause a couple of assertions further in the test to (rightfully) be in play:

// Check that, while the Upgrade Watcher is running, the upgrade details in Agent status
// show the state as UPG_WATCHING.
if !upgradeOpts.disableUpgradeWatcherUpgradeDetailsCheck {
logger.Logf("Checking upgrade details state while Upgrade Watcher is running")
if err := waitUpgradeDetailsState(ctx, startFixture, details.StateWatching, 2*time.Minute, 10*time.Second, logger); err != nil {
// error context added by waitUpgradeDetailsState
return err
}
}

// Check that, upon successful upgrade, the upgrade details have been cleared out
// from Agent status.
if !upgradeOpts.disableUpgradeWatcherUpgradeDetailsCheck {
logger.Logf("Checking upgrade details state after successful upgrade")
if err := waitUpgradeDetailsState(ctx, startFixture, "", 2*time.Minute, 10*time.Second, logger); err != nil {
// error context added by checkUpgradeDetailsState
return err
}
}

I can see from the test's log output that the test is failing as part of the first assertion mentioned above. But I don't understand why, as a local build of 8.13.0 Agent should have the necessary changes for the assertion to pass.

As this failure can be reproduced locally, what is the output of elastic-agent status while the test is running, right after the upgrade watcher has started up? Also, for good measure, what are the contents of the .update-marker file at the same time?

@cmacknz
Copy link
Member

cmacknz commented Jan 5, 2024

There are no upgrade details here because 7.17 is too old to write them. Then again so is the 8.11.4 version used in the minor version - 2 test. The biggest difference between 7.17 and 8.11.4 is 7.17 is using the "old" version of the watcher.

This logic came in as part of #3827 which is in 8.12.

Edit: the change is in 8.12

@cmacknz
Copy link
Member

cmacknz commented Jan 5, 2024

The 8.12 backport of #3927 did not update the version gating the logic in the test so this problem has been there but hiding from us in all of our test runs.

@cmacknz
Copy link
Member

cmacknz commented Jan 5, 2024

This might just be because of the 7.17 watcher running, it is too old to write the watching state out so the test is expecting upgrade details but they don’t exist. 8.11.4 runs the 8.13 watcher which knows to write the upgrade details file. So this is looking like a test bug and not an agent bug.

I ran the 7.17.x to 8.13.x upgrade myself manually and it works fine.

@ycombinator
Copy link
Contributor

The 8.12 backport of #3927 did not update the version gating the logic in the test so this problem has been there but hiding from us in all of our test runs.

Yup, miss on my part. At this point, we should just update the minimum version in the guard to 8.12.0 both on main and in the 8.12 branch, since the necessary code is in both branches now. Here's the PR to main that I'll backport to 8.12 after it's merged: #4036

@ycombinator
Copy link
Contributor

This might just be because of the 7.17 watcher running, it is too old to write the watching state out so the test is expecting upgrade details but they don’t exist. 8.11.4 runs the 8.13 watcher which knows to write the upgrade details file. So this is looking like a test bug and not an agent bug.

I ran the 7.17.x to 8.13.x upgrade myself manually and it works fine.

But looking at the test failure, the error is:

could not unmarshal agent status output: error: context deadline exceeded, output: 
unexpected end of JSON input

That seems like elastic-agent status --output json is failing. So the test isn't even getting to the checks related to upgrade details being in the upgrade marker.

If the upgrade from 7.17.x to 8.13.x works manually, that does seem to suggest a test bug rather than an agent bug. But it would still be good to know why elastic-agent status --output json is failing. Investigating...

@ycombinator
Copy link
Contributor

I ran the 7.17.x to 8.13.x upgrade myself manually and it works fine.

How were you able to run this? I tried this with standalone Agent (with --source-uri file:///path/to/build/distributions --skip-verify) and also via Fleet running 8.13.0-SNAPSHOT in CFT. In either case I get an unknown hash error.

@cmacknz
Copy link
Member

cmacknz commented Jan 8, 2024

How were you able to run this? I tried this with standalone Agent (with --source-uri file:///path/to/build/distributions --skip-verify) and also via Fleet running 8.13.0-SNAPSHOT in CFT. In either case I get an unknown hash error.

I also had this problem, it was because I'm on an M1 Mac but we don't publish ARM packages for 7.17. Once I built the X86 darwin package locally the error went away. Likely the upgrade should have failed with a better error message.

@cmacknz
Copy link
Member

cmacknz commented Jan 8, 2024

#4038 has the fix needed to get this merged.

@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2024

buildkite test it

Copy link

Quality Gate passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No Coverage information No data about Coverage
No Duplication information No data about Duplication

See analysis details on SonarQube

@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2024

Test failure is unrelated, force merging.

@cmacknz cmacknz merged commit bf513a5 into main Jan 9, 2024
8 of 9 checks passed
@cmacknz cmacknz deleted the integration-test-to-8.13.0 branch January 9, 2024 19:32
@cmacknz
Copy link
Member

cmacknz commented Jan 9, 2024


Error: error running test: failed to prepare instance ogc-windows-amd64-2022-default-f008: failed to install chocolatey: could not run "powershell -NoProfile -InputFormat None -ExecutionPolicy Bypass -Command \"[System.Net.ServicePointManager]::SecurityProtocol = 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))\"" though SSH: Process exited with status 1 (stdout: , stderr: Exception calling "DownloadString" with "1" argument(s): "The remote name could not be resolved:
--
  | 2024-01-09 14:14:48 EST | 'community.chocolatey.org'"
  | 2024-01-09 14:14:48 EST | At line:1 char:60
  | 2024-01-09 14:14:48 EST | + ... col = 3072; iex ((New-Object System.Net.WebClient).DownloadString('ht ...
  | 2024-01-09 14:14:48 EST | +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  | 2024-01-09 14:14:48 EST | + CategoryInfo          : NotSpecified: (:) [], MethodInvocationException
  | 2024-01-09 14:14:48 EST | + FullyQualifiedErrorId : WebException

We keep hitting DNS errors across a few different dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants