You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ran a diagnostic GitHub Actions workflow on ubuntu-24.04 runners testing supervisor versions 0.0.63, 0.0.72, 0.0.73, and :latest
Confirmed apparmor_restrict_unprivileged_userns=1 is active on GHA runners — AppArmor denies CAP_SETPCAP operations inside rootless Podman user namespaces
Verified via dmesg audit log: apparmor="DENIED" operation="capable" profile="unprivileged_userns" capability=21 capname="sys_admin"
Performed A/B test: supervisor 0.0.63 and 0.0.72 create sandboxes successfully; 0.0.73 crashes during sandbox creation
Confirmed the same crash on macOS + Podman machine (Fedora 41 CoreOS aarch64)
Reviewed PR fix(supervisor): drop sandbox child capability bounding set #2001 comments — a reviewer explicitly warned: "this still fails closed incorrectly for the Podman path" and "the regression test skips when CAP_SETPCAP is unavailable, so it would not catch the Podman-relevant failure mode"
Did not use the repo's .agents/skills/ skills (investigation was done from a downstream consumer's perspective using GHA diagnostics, gh CLI, and code review of the relevant crates)
Agent could not resolve this — the fix requires changes to validate_capability_bounding_set_clear() in the supervisor crate
Description
OpenShell v0.0.73 supervisor crashes during sandbox creation in rootless Podman on hosts where AppArmor restricts unprivileged user namespaces (apparmor_restrict_unprivileged_userns=1, the default on Ubuntu 24.04).
PR #2001 added drop_capability_bounding_set() which calls capctl::caps::bounding::clear(), requiring effective CAP_SETPCAP. PR #2001 also added SETPCAP to the Podman driver's cap_add to provide this capability. However, on Ubuntu 24.04 with apparmor_restrict_unprivileged_userns=1, AppArmor transitions processes entering user namespaces (which rootless Podman creates) into the unprivileged_userns profile. This profile denies capability operations at the kernel level — so bounding::clear() returns EPERM even though Podman granted SETPCAP.
The fallback in validate_capability_bounding_set_clear() handles:
Ok(()) + empty set → success
EPERM + empty set → tolerated (set already clear)
EPERM + non-empty set → fatal error ← this is the unhandled case
In rootless Podman, the bounding set retains SYS_ADMIN, NET_ADMIN, SETPCAP, etc. from --cap-add, so the third branch fires and the supervisor exits.
This is distinct from #2068 (:latest pinning). #2068 addresses which version gets pulled; this issue addresses a crash bug in v0.0.73 that must be fixed for the version to work in rootless Podman environments.
Reproduction Steps
On any Ubuntu 24.04 host or GitHub Actions ubuntu-24.04 runner:
Run a sandbox with supervisor v0.0.72 (pre-PR-2001) — succeeds:
Run a sandbox with supervisor v0.0.73 (post-PR-2001) — crashes:
# Configure gateway.toml with:# supervisor_image = "ghcr.io/nvidia/openshell/supervisor:0.0.73"
openshell sandbox create --from base
# → "sandbox is not ready" — supervisor exits with EPERM during drop_privileges()
The crash occurs during sandbox creation when drop_privileges() calls drop_capability_bounding_set() for a child process — not at startup or --version.
Also reproduced on macOS + Podman machine (Fedora 41 CoreOS aarch64)
Supervisor image: ghcr.io/nvidia/openshell/supervisor:0.0.73 (= :latest as of 2026-06-30T15:31Z)
Logs
# AppArmor audit from dmesg on GHA runner:
audit: type=1400 apparmor="DENIED" operation="capable" class="cap"
profile="unprivileged_userns" pid=2536 comm="unshare"
capability=21 capname="sys_admin"
# Supervisor error (from issue #2067 report, same root cause):
WARN openshell_supervisor_network::proxy: host.openshell.internal maps to a non-link-local IP; trusted-gateway SSRF exemption disabled
WARN openshell_supervisor_process::netns: Failed to delete network namespace
Error: × Invalid argument (os error 22)
Suggested Fix
validate_capability_bounding_set_clear() needs a fourth branch for EPERM + non-empty bounding set:
Log a warning and continue — the child is already constrained by seccomp + Landlock + the container's own restrictions
Or: probe CAP_SETPCAP effectiveness before calling bounding::clear(), and skip when ineffective
The PR #2001 reviewer also noted: "The current regression test skips when CAP_SETPCAP is unavailable, so it would not catch the Podman-relevant failure mode." Adding a rootless Podman CI test target would prevent future regressions.
Workaround
Pin the supervisor image to a pre-v0.0.73 version in gateway.toml:
Agent Diagnostic
:latestre-tagging to v0.0.73 on 2026-06-30drop_capability_bounding_set()incrates/openshell-supervisor-process/src/process.rs, introduced by PR fix(supervisor): drop sandbox child capability bounding set #2001ubuntu-24.04runners testing supervisor versions 0.0.63, 0.0.72, 0.0.73, and:latestapparmor_restrict_unprivileged_userns=1is active on GHA runners — AppArmor deniesCAP_SETPCAPoperations inside rootless Podman user namespacesdmesgaudit log:apparmor="DENIED" operation="capable" profile="unprivileged_userns" capability=21 capname="sys_admin"CAP_SETPCAPis unavailable, so it would not catch the Podman-relevant failure mode".agents/skills/skills (investigation was done from a downstream consumer's perspective using GHA diagnostics,ghCLI, and code review of the relevant crates)validate_capability_bounding_set_clear()in the supervisor crateDescription
OpenShell v0.0.73 supervisor crashes during sandbox creation in rootless Podman on hosts where AppArmor restricts unprivileged user namespaces (
apparmor_restrict_unprivileged_userns=1, the default on Ubuntu 24.04).PR #2001 added
drop_capability_bounding_set()which callscapctl::caps::bounding::clear(), requiring effectiveCAP_SETPCAP. PR #2001 also addedSETPCAPto the Podman driver'scap_addto provide this capability. However, on Ubuntu 24.04 withapparmor_restrict_unprivileged_userns=1, AppArmor transitions processes entering user namespaces (which rootless Podman creates) into theunprivileged_usernsprofile. This profile denies capability operations at the kernel level — sobounding::clear()returnsEPERMeven though Podman grantedSETPCAP.The fallback in
validate_capability_bounding_set_clear()handles:Ok(())+ empty set → successEPERM+ empty set → tolerated (set already clear)EPERM+ non-empty set → fatal error ← this is the unhandled caseIn rootless Podman, the bounding set retains
SYS_ADMIN,NET_ADMIN,SETPCAP, etc. from--cap-add, so the third branch fires and the supervisor exits.This is distinct from #2068 (
:latestpinning). #2068 addresses which version gets pulled; this issue addresses a crash bug in v0.0.73 that must be fixed for the version to work in rootless Podman environments.Reproduction Steps
On any Ubuntu 24.04 host or GitHub Actions
ubuntu-24.04runner:Run a sandbox with supervisor v0.0.72 (pre-PR-2001) — succeeds:
Run a sandbox with supervisor v0.0.73 (post-PR-2001) — crashes:
The crash occurs during sandbox creation when
drop_privileges()callsdrop_capability_bounding_set()for a child process — not at startup or--version.Environment
ubuntu-24.04runner (image20260622.220.1, kernel6.17.0-1018-azure)apparmor_restrict_unprivileged_userns = 1(Ubuntu 24.04 default)ghcr.io/nvidia/openshell/supervisor:0.0.73(=:latestas of 2026-06-30T15:31Z)Logs
Suggested Fix
validate_capability_bounding_set_clear()needs a fourth branch forEPERM+ non-empty bounding set:CAP_SETPCAPeffectiveness before callingbounding::clear(), and skip when ineffectiveThe PR #2001 reviewer also noted: "The current regression test skips when
CAP_SETPCAPis unavailable, so it would not catch the Podman-relevant failure mode." Adding a rootless Podman CI test target would prevent future regressions.Workaround
Pin the supervisor image to a pre-v0.0.73 version in
gateway.toml:Related
:latestpinning (complementary fix, addresses the mutable tag problem)ip netns addfails on hardened/immutable hosts (same class of rootless restrictions)Agent-First Checklist
debug-openshell-cluster,debug-inference,openshell-cli)