-
Notifications
You must be signed in to change notification settings - Fork 2.9k
rootless: use nsfs file handles to persist namespaces #27880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
88d18ef to
65d8b55
Compare
|
[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore. |
2 similar comments
|
[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore. |
|
[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore. |
d53f620 to
0dcca5e
Compare
|
@containers/podman-maintainers tests are passing, ready for review |
libpod/runtime_migrate_linux.go
Outdated
| } | ||
|
|
||
| nsHandlesPath := rootless.GetNamespaceHandlesPath(stateDir) | ||
| _ = os.Remove(nsHandlesPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be logged in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a warning
|
|
||
| // GetNamespaceHandlesPath returns the path to the namespace handles file | ||
| // in the given state directory. | ||
| func GetNamespaceHandlesPath(stateDir string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be good to drop a couple of unit tests here to prevent regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
LGTM, I had a couple of sideways questions that you can decide on ... we should get somebody with more C to review this too. |
pkg/rootless/rootless_linux.c
Outdated
| int p[2]; | ||
| char pause_pid_file_path[PATH_MAX]; | ||
|
|
||
| snprintf (pause_pid_file_path, PATH_MAX, "%s/pause.pid", state_dir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also check for PATH_MAX here?
if (ret >= PATH_MAX)
{
errno = ENAMETOOLONG;
return -1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, added
|
What does this look like when upgrading from an older Podman without a restart? Will the user need a |
0dcca5e to
86f39f4
Compare
no, it will automatically join the pause process as we do now, then save the file with the handles. There is no manual intervention required. It won't kill the pause process though, so it can be a problem if someone mixes different versions and restart the pause process |
fdb0407 to
692410d
Compare
|
comments addressed |
a8a6764 to
5ec69ed
Compare
|
@giuseppe , Windows test fails with TestGetPausePidPath, this looks suspicious. |
use name_to_handle_at and open_by_handle_at to persist rootless namespaces without needing a pause process. The namespace file handles are stored in a file and can be used to rejoin the namespaces, as long as the namespaces still exist. Fall back to the pause process approach only when the kernel doesn't support nsfs handles (EOPNOTSUPP). These changes in the kernel are required (landed in Linux 6.18): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3ab378cfa793 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
I will test locally on WSL/Hyper-V this afternoon, to provide more details. |
5ec69ed to
c33eaf9
Compare
|
I've just pushed a new version. Let's see how it goes |
Luap99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't really looked at the code yet but ...
Is that not a major possible breaking change if we recreate a new ns all the time?
Something trivial as podman unshare podman mount X && podman unshare ls /mountpath will no longer work. The commands now must be in the same unshare command which is not a given. It also messes up the mount count state which can uncover problems like containers/buildah#6480
Also what is the cost of having to rexec all the time, do we notice a performance penalty? The other thing is that we expose the recreate the the ns outside of the libpod alive lock code path much more now as basically any command ends up in there when there are no running containers.
I think not having pause process is certainly nice but I do wonder if we should really switch to this by default as it could impact quite a lot of things. For a practical thing I don't like that the behavior would be vastly different between kernel versions.
If a user writes a script with two podman unshare commands that depend on the namespace state they might think it works fine and then get all of the sudden broken by a kernel update as we changed the behavior significantly.
we do not create it all the time, we create it only when there are no other containers running. As long as there is at least one container or podman command keeping alive the namespaces, we reuse them. Creating a user namespace all the time, would be an issue as we can't share resources among containers. Something trivial as yes, that won't work. The mount itself won't keep the namespaces alive. We already have a different behavior with buildah vs podman, so this looks like an occasion to unify the two tools. It will be easier to move the current behavior to the container-libs instead of relying on the pause process. Having the pause process is not only a nuisance, it also leaks resources no matter if containers are running or not. As long as you run some Podman commands, now you'll leak forever a process. After this change there is no leak when there are no containers running. |
I got that, what I am saying is if there is no container running we rexec every single time which comes at a noticeable cost this PR: main: That suggest a simple command that doesn't have to do much is now twice as slow which does seem like an issue to me.
Sure, I am not saying the process leak is nice. I like to get rid of it. But I do wonder if the consequences of this new logic are not worse then the one process? At the very least I very much dislike that user will observe vastly different behaviours based on the kernel version. |
a quick comparison to other PRs doesn't seem to suggest a visible effect in the CI. What do you suggest to move forward? Make it configurable? |
use name_to_handle_at and open_by_handle_at to persist rootless namespaces without needing a pause process.
The namespace file handles are stored in a file and can be used to rejoin the namespaces, as long as the namespaces still exist.
Fall back to the pause process approach only when the kernel doesn't support nsfs handles (EOPNOTSUPP).
These changes in the kernel are required (landed in Linux 6.18):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3ab378cfa793
Checklist
Ensure you have completed the following checklist for your pull request to be reviewed:
commits. (
git commit -s). (If needed, usegit commit -s --amend). The author email must matchthe sign-off email address. See CONTRIBUTING.md
for more information.
Fixes: #00000in commit message (if applicable)make validatepr(format/lint checks)Noneif no user-facing changes)Does this PR introduce a user-facing change?