Sir, Stop Blaming Setuid for Your Troubles!

So we’re trying to support Podman on the HPC systems and one of the things it needs is entries for each user in /etc/subuid and /etc/subgid. Filling those files out is not exactly easy when you thousands of nodes and you need to update those files on each node every time a new user joins. An idea one of the folks here had was to instead put those files in the shared NFS that is mounted on all the nodes, and then simply create a symlink to those files. Now you only have one set of files to manage that all the nodes can see. Easy peezy!

Except, well,

$ podman image ls
ERRO[0000] running `/usr/bin/newuidmap 546533 0 12345 1 1 100000000 65536`:
Error: cannot set up namespace using "/usr/bin/newuidmap": exit status 1

And also for Apptainer (another container runtime we support)

$ apptainer build lolcow.sif lolcow.def
ERROR  : 'newgidmap' execution failed. Check that 'newgidmap' is setuid root or has setcap cap_setgid+eip.
ERROR  : Error while waiting event for user namespace mappings: no event received

Seeing that message, you would think “clearly the error message is spelling it out for you, you dolt!“. That’s what I thought too, including the ‘you dolt!’ part. So that’s the direction I was going in trying to debug this (turning on verbose debug output didn’t net me anything new, so this is all I had to go on). I thought it might be a setuid or setcap issue as told to me by the error message. The wrench in this thought process though, is that this is happening on one cluster but on another cluster with seemingly the same setup Podman and Apptainer worked fine without any of these issues. From what I could tell, newgidmap and newuidmap had the same setuid and capabilities on both those clusters. So why was it failing on one but not on the other?

We need to go deeper

God I love strace. First time I used it I found it overwhelming and a little scary. That was just the imposter syndrome telling me I’m no good at this computer business, when it was just that I didn’t really have an understanding of how Linux worked and no one really to teach me what strace is doing when its pumping out all that output. Having human mentors showing you the ropes can go a long way to show you that you can do this too and that you belong here. The manual is there for additional information once you have your feet under you first, and everyone needs help and support.

Kamina: Believe in the me who believes in you

Anyway, strace. Apptainer was the runtime that I was most familiar with so that’s the one I would inspect. Since it looked like both Apptainer and Podman were seeing the same issue with new*idmap, fixing things for one would likely fix it for the other.

$ strace -ff apptainer build lolcow.sif lolcow.def
<bunch of irrelevant output>
[pid 11668] openat(AT_FDCWD, "/etc/subgid", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_NOFOLLOW) = -1 ELOOP (Too many levels of symbolic links)
[pid 11668] exit_group(1)               = ?q
[pid 11667] <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 11668
[pid 11667] rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe617544980}, {sa_handler=0x55b3e6397dc0, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fe617544980}, 8) = 0
[pid 11667] ioctl(2, TIOCGWINSZ, 0x7ffdd1270510) = -1 ENOTTY (Inappropriate ioctl for device)
[pid 11667] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
[pid 11667] rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
[pid 11667] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
[pid 11667] exit_group(1)               = ?
[pid 11667] +++ exited with 1 +++
[pid 11639] <... wait4 resumed>[{WIFEXITED(s) && WEXITSTATUS(s) == 1}], 0, NULL) = 11667
[pid 11639] rt_sigaction(SIGINT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fba06242980}, NULL, 8) = 0
[pid 11639] rt_sigaction(SIGQUIT, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7fba06242980}, NULL, 8) = 0
[pid 11639] rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
[pid 11639] write(2, "\33[91mERROR  : 'newgidmap' execut"..., 116ERROR  : 'newgidmap' execution failed. Check that 'newgidmap' is setuid root or has setcap cap_setgid+eip.
) = 116

Ha! The culprit is the first line [pid 11668] openat(AT_FDCWD, "/etc/subgid", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_NOFOLLOW) = -1 ELOOP (Too many levels of symbolic links). PID 11668 is the PID for the process running newgidmap within Apptainer. newgidmap is trying to open /etc/subgid, but is failing because of… too many levels of symbolic links? Wait, what?

Remember when I said earlier

An idea one of the folks here had was to instead put those files in the shared NFS that is mounted on all the nodes, and then simply create a symlink to those files.

“those files” being the subuid and subgid files, with symlinks in /etc/subuid and /etc/subgid pointing to the actual files. newuidmap and newgidmap don’t like it when /etc/sub*id is actually a symlink. If you look at the code in shadow-utils (which newuidmap and newgidmap are a part of, and shadow-utils v4.8.1 is what our cluster has ¹.), line 635 within the comonio_open function (which is called from the sub_gid_open function in subordinateio.c which is called from newgidmap.c) is what tries to open the file, but the open call uses O_NOFOLLOW which means it will error out if it encounters a symlink instead of trying to open the linked file. From the man 2 open

O_NOFOLLOW
     If the trailing component (i.e., basename) of pathname is a symbolic link, then the open fails, with the error ELOOP.  Symbolic links in earlier components of the pathname  will
     still be followed.  (Note that the ELOOP error that can occur in this case is indistinguishable from the case where an open fails because there are too many symbolic links found
     while resolving components in the prefix part of the pathname.)

     This flag is a FreeBSD extension, which was added in Linux 2.1.126, and has subsequently been standardized in POSIX.1-2008.

Soooooooo, that means we can’t set up /etc/sub*id as symlinks to files in a shared location. Which kinda sucks. So, back to the drawing board for us.

Anyway, I did open an issue about this with Apptainer so hopefully in future versions someone doing the same thing won’t get tripped up by a misleading error message. Even if there isn’t a check put in this for Apptainer, at least it will be (hopefully) discoverable on a search engine.

Update 2025-05-25 - I later discovered that the lack of a clear error message was brought up in the shadow-utils repository as well where someone encountered the very same problem. A patch was merged so we should see a proper error message in a future version if someone encounters this. ^[return]