Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent false positives in already running k0s instance detection #5400

Closed

Conversation

peterhoneder
Copy link

Description

When running k0s in one of our deployments, we noticed that during startup with an existing runtime config file, it thinks it's running, but it isn't since the system just started up from scratch. The root cause is, that the detection in the linux runtime works by checking if the pid from the runtime config is a running process on the system, but not if the pid is actually the same executable image.

So what happens is, that if any other process during startup takes up that pid value from the runtime config, k0s thinks it is already running although it isn't.

I guess an even better method to solve this would be to hold a lock file instead of writing a pid value in the yaml, but the fix right now addresses the topic by checking if the pid value is from the same executable image, and if not, it will allow k0s to be started although the pid value is taken by another process.

Fixes #5399

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

The tests extend the current runtime tests by 3 more cases:

  • check if NewRuntimeConfig detects if a pid from the runtime config is running with the same executable image as its own
  • check if another process is running with the pid value from the runtime config, but is from a different executable image, that it will not result in ErrK0sAlreadyRunning
  • check if no process is running under the pid from the runtime config, it will also work correctly (no error)

Checklist:

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@peterhoneder peterhoneder requested review from a team as code owners January 3, 2025 11:46
@peterhoneder peterhoneder marked this pull request as draft January 3, 2025 12:35
@peterhoneder peterhoneder force-pushed the fix-runtime-pid-checks branch from e19014e to de62d7c Compare January 3, 2025 17:22
The main bugfix is to check if a possibly running pid from the runtime
config not only exists as a process, but also is the same excutable
image that is checking.

All platforms use a common library for ps-style checks now in this
implementation.

This changes the behavior for all platforms to use a common library
to perform the "is k0s running" checks on startup. The advantage is,
that the code is much more simple, the disadvantage that an additional
library is now required with additional sub-dependencies.

This commit also refactors the function checkPid into checkInstanceRunning,
since the previous implementation did not allow for more complex error
handling and would have hidden errors. More complex error handling is
now required because the implementation is more complex.

Signed-off-by: Peter Honeder <peter.honeder@unwired.at>
@peterhoneder peterhoneder force-pushed the fix-runtime-pid-checks branch from de62d7c to 0f4f23e Compare January 3, 2025 17:35
@peterhoneder peterhoneder marked this pull request as ready for review January 3, 2025 19:32
Copy link
Member

@twz123 twz123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! And double thanks for providing a PR with thorough comments.

When running k0s in one of our deployments, we noticed that during startup with an existing runtime config file, it thinks it's running, but it isn't since the system just started up from scratch.

This is interesting, as the runtime config path is usually stored below /run, which is usually a tmpfs and shouldn't survive reboots.

So what happens is, that if any other process during startup takes up that pid value from the runtime config, k0s thinks it is already running although it isn't.

The good ol' problem with PID file races. This has been addressed in in other parts of k0s already, but it's still a problem here.

I guess an even better method to solve this would be to hold a lock file instead of writing a pid value in the yaml, [...]

This is related to #1434, which is about replacing the runtime config file entirely by using the k0s status socket.

[...] but the fix right now addresses the topic by checking if the pid value is from the same executable image, and if not, it will allow k0s to be started although the pid value is taken by another process.

Agree. Left some comments in the review.

// (infinity is not available on default macOS zsh), this simulates
// a different running process with the same pid as stored in the runtime config, but
// from a different executable image
cmd := getBackgroundCommand()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these kinds of tests, you could use internal/testutil/pingpong, which provides a no-op testing process and might help avoid some boilerplate here.

Comment on lines +94 to +97
// - actually test if k0s is still running by checking against:
// - the same executable with the same pid as the runtime config
// - a different executable with the same pid as the runtime config
// - a pid that is not running anymore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. It would be super helpful to have all of those in separate tests.

@@ -1,7 +1,7 @@
//go:build unix

/*
Copyright 2023 k0s authors
Copyright 2025 k0s authors
Copy link
Member

@twz123 twz123 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave the copyright date as is.


// check that the executable is the same in order to issue an error, if it would be a different
// than it can't be a running instance of k0s, which the check is all about
if exeTarget != currentExe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's os.SameFile which might help here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take this approach, there is a chance that another running copy of the k0s binary won't be detected as already running. Say k0s was started by the init system and is running in the background. Then someone grabs a fresh binary and runs sudo ./k0s controller. This will certainly cause some problems. I am not sure how contrived this scenario is, though.


// Get the executable path of the target pid
// - no need to resolve symlinks here, it is already a resolved path
exeTarget, err := proc.Exe()
Copy link
Member

@twz123 twz123 Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather hardcode something like os.Stat(filepath.Join("/proc", pid, "exe")) here rather than pulling in a rather large dependency. This wouldn't work on Windows, but I'd revisit this another time. If runtime.GOOS is not "linux" , you could simply skip this function call, or return successful unconditionally, or return an error that will unwrap to errors.ErrUnsupported and check for this on the call sites ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe os.Args[0] would be a platform-agnostic alternative?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually my first PR commit had just a linux impl using the procfs. @twz123 the reason for this using procfs or something else to find the executable path, is actually that you need to find the executable path of a foreign pid too, not just your own.
I also think it would be much better to just implement something for Linux, or maybe even drop the PR completely, if you plan to change the runtime config parts anyway soon. Here was the original minimal change to checkPid that will only work on Linux:
e19014e#diff-406aeff8e9fcab83e1f62a534a0b5f9292b2cad0bf7d874a526fc15a9f6e8fe8L27

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to find the executable path of a foreign pid too, not just your own.

... of course 🤦‍♂️

or maybe even drop the PR completely, if you plan to change the runtime config parts anyway soon.

Not sure about the soon part. It's definitely something worth doing, but that issue is lying around for years already, and is not really on the roadmap. You could try to tackle it, If you're interested! OTOH, we could also try to get this PR merged in the meantime.

Copy link
Contributor

github-actions bot commented Jan 6, 2025

This pull request has merge conflicts that need to be resolved.

@ncopa
Copy link
Collaborator

ncopa commented Jan 14, 2025

I had a look at this and my general thinking is that this is a complicated and unreliable solution. I think we should simply drop the pid check completely and use other ways to prevent multiple instances of k0s, for example using flock(2).

@ncopa
Copy link
Collaborator

ncopa commented Jan 14, 2025

I have an alternative way to solve this by simply removing the pid from config and use flock instead to ensure unique instance of k0s: #5435

@peterhoneder
Copy link
Author

I have an alternative way to solve this by simply removing the pid from config and use flock instead to ensure unique instance of k0s: #5435

I like that much more and I think that's the better solution!

@peterhoneder
Copy link
Author

closing because no reason to continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

k0s error "an instance of k0s is already running" but it is not running
3 participants