Skip to content

Conversation

@giuseppe
Copy link
Member

@giuseppe giuseppe commented Jan 8, 2026

use name_to_handle_at and open_by_handle_at to persist rootless namespaces without needing a pause process.

The namespace file handles are stored in a file and can be used to rejoin the namespaces, as long as the namespaces still exist.

Fall back to the pause process approach only when the kernel doesn't support nsfs handles (EOPNOTSUPP).

These changes in the kernel are required (landed in Linux 6.18):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3ab378cfa793

Checklist

Ensure you have completed the following checklist for your pull request to be reviewed:

  • Certify you wrote the patch or otherwise have the right to pass it on as an open-source patch by signing all
    commits. (git commit -s). (If needed, use git commit -s --amend). The author email must match
    the sign-off email address. See CONTRIBUTING.md
    for more information.
  • Referenced issues using Fixes: #00000 in commit message (if applicable)
  • Tests have been added/updated (or no tests are needed)
  • Documentation has been updated (or no documentation changes are needed)
  • All commits pass make validatepr (format/lint checks)
  • Release note entered in the section below (or None if no user-facing changes)

Does this PR introduce a user-facing change?

From Linux 6.18, rootless Podman won't create a "pause" process to keep the user and mount namespaces alive.

@giuseppe giuseppe force-pushed the drop-pause-process branch 10 times, most recently from 88d18ef to 65d8b55 Compare January 9, 2026 17:10
@packit-as-a-service
Copy link

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

2 similar comments
@packit-as-a-service
Copy link

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

@packit-as-a-service
Copy link

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

@giuseppe giuseppe force-pushed the drop-pause-process branch 2 times, most recently from d53f620 to 0dcca5e Compare January 9, 2026 23:36
@giuseppe giuseppe changed the title [WIP] rootless: use nsfs file handles to persist namespaces rootless: use nsfs file handles to persist namespaces Jan 10, 2026
@giuseppe giuseppe marked this pull request as ready for review January 10, 2026 22:02
@giuseppe
Copy link
Member Author

@containers/podman-maintainers tests are passing, ready for review

}

nsHandlesPath := rootless.GetNamespaceHandlesPath(stateDir)
_ = os.Remove(nsHandlesPath)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be logged in any way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a warning


// GetNamespaceHandlesPath returns the path to the namespace handles file
// in the given state directory.
func GetNamespaceHandlesPath(stateDir string) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be good to drop a couple of unit tests here to prevent regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@baude
Copy link
Member

baude commented Jan 12, 2026

LGTM, I had a couple of sideways questions that you can decide on ... we should get somebody with more C to review this too.

int p[2];
char pause_pid_file_path[PATH_MAX];

snprintf (pause_pid_file_path, PATH_MAX, "%s/pause.pid", state_dir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also check for PATH_MAX here?

  if (ret >= PATH_MAX)
    {
      errno = ENAMETOOLONG;
      return -1;
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, added

@mheon
Copy link
Member

mheon commented Jan 12, 2026

What does this look like when upgrading from an older Podman without a restart? Will the user need a system migrate to kill the existing pause process and make sure all rootless containers are using the new, correct rootless userns?

@giuseppe
Copy link
Member Author

giuseppe commented Jan 12, 2026

What does this look like when upgrading from an older Podman without a restart? Will the user need a system migrate to kill the existing pause process and make sure all rootless containers are using the new, correct rootless userns?

no, it will automatically join the pause process as we do now, then save the file with the handles. There is no manual intervention required. It won't kill the pause process though, so it can be a problem if someone mixes different versions and restart the pause process

@giuseppe giuseppe force-pushed the drop-pause-process branch 3 times, most recently from fdb0407 to 692410d Compare January 14, 2026 11:21
@giuseppe
Copy link
Member Author

comments addressed

@giuseppe giuseppe force-pushed the drop-pause-process branch 2 times, most recently from a8a6764 to 5ec69ed Compare January 15, 2026 12:08
@jankaluza
Copy link
Member

@giuseppe , Windows test fails with TestGetPausePidPath, this looks suspicious.

use name_to_handle_at and open_by_handle_at to persist rootless
namespaces without needing a pause process.

The namespace file handles are stored in a file and can be used to
rejoin the namespaces, as long as the namespaces still exist.

Fall back to the pause process approach only when the kernel doesn't
support nsfs handles (EOPNOTSUPP).

These changes in the kernel are required (landed in Linux 6.18):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3ab378cfa793

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@l0rd
Copy link
Member

l0rd commented Jan 19, 2026

@giuseppe , Windows test fails with TestGetPausePidPath, this looks suspicious.

I will test locally on WSL/Hyper-V this afternoon, to provide more details.

@giuseppe
Copy link
Member Author

I've just pushed a new version. Let's see how it goes

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't really looked at the code yet but ...

Is that not a major possible breaking change if we recreate a new ns all the time?
Something trivial as podman unshare podman mount X && podman unshare ls /mountpath will no longer work. The commands now must be in the same unshare command which is not a given. It also messes up the mount count state which can uncover problems like containers/buildah#6480

Also what is the cost of having to rexec all the time, do we notice a performance penalty? The other thing is that we expose the recreate the the ns outside of the libpod alive lock code path much more now as basically any command ends up in there when there are no running containers.

I think not having pause process is certainly nice but I do wonder if we should really switch to this by default as it could impact quite a lot of things. For a practical thing I don't like that the behavior would be vastly different between kernel versions.
If a user writes a script with two podman unshare commands that depend on the namespace state they might think it works fine and then get all of the sudden broken by a kernel update as we changed the behavior significantly.

@giuseppe
Copy link
Member Author

Is that not a major possible breaking change if we recreate a new ns all the time?

we do not create it all the time, we create it only when there are no other containers running. As long as there is at least one container or podman command keeping alive the namespaces, we reuse them. Creating a user namespace all the time, would be an issue as we can't share resources among containers.

Something trivial as podman unshare podman mount X && podman unshare ls /mountpath will no longer work. The commands now must be in the same unshare command which is not a given. It also messes up the mount count state which can uncover problems like containers/buildah#6480

yes, that won't work. The mount itself won't keep the namespaces alive. We already have a different behavior with buildah vs podman, so this looks like an occasion to unify the two tools. It will be easier to move the current behavior to the container-libs instead of relying on the pause process. Having the pause process is not only a nuisance, it also leaks resources no matter if containers are running or not. As long as you run some Podman commands, now you'll leak forever a process. After this change there is no leak when there are no containers running.

@Luap99
Copy link
Member

Luap99 commented Jan 19, 2026

Is that not a major possible breaking change if we recreate a new ns all the time?

we do not create it all the time, we create it only when there are no other containers running. As long as there is at least one container or podman command keeping alive the namespaces, we reuse them. Creating a user namespace all the time, would be an issue as we can't share resources among containers.

I got that, what I am saying is if there is no container running we rexec every single time which comes at a noticeable cost

this PR:

$ hyperfine "bin/podman ps"
Benchmark 1: bin/podman ps
  Time (mean ± σ):      71.8 ms ±   5.0 ms    [User: 53.1 ms, System: 33.8 ms]
  Range (min … max):    65.6 ms …  94.1 ms    31 runs

main:

$ hyperfine "bin/podman ps"
Benchmark 1: bin/podman ps
  Time (mean ± σ):      34.0 ms ±   2.9 ms    [User: 25.4 ms, System: 14.4 ms]
  Range (min … max):    26.0 ms …  44.4 ms    83 runs

That suggest a simple command that doesn't have to do much is now twice as slow which does seem like an issue to me.
I would expect that quickly adds up even in our CI systems.

Something trivial as podman unshare podman mount X && podman unshare ls /mountpath will no longer work. The commands now must be in the same unshare command which is not a given. It also messes up the mount count state which can uncover problems like containers/buildah#6480

yes, that won't work. The mount itself won't keep the namespaces alive. We already have a different behavior with buildah vs podman, so this looks like an occasion to unify the two tools. It will be easier to move the current behavior to the container-libs instead of relying on the pause process. Having the pause process is not only a nuisance, it also leaks resources no matter if containers are running or not. As long as you run some Podman commands, now you'll leak forever a process. After this change there is no leak when there are no containers running.

Sure, I am not saying the process leak is nice. I like to get rid of it. But I do wonder if the consequences of this new logic are not worse then the one process? At the very least I very much dislike that user will observe vastly different behaviours based on the kernel version.

@giuseppe
Copy link
Member Author

I would expect that quickly adds up even in our CI systems.

a quick comparison to other PRs doesn't seem to suggest a visible effect in the CI.

What do you suggest to move forward? Make it configurable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants