Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,6 +285,18 @@ workload entrypoint PID to `OPENSHELL_ENTRYPOINT_PID_FILE`
should read it for binary-scoped policy decisions; if allowed network rules are
all denied, inspect that file and the network sidecar logs.

If `supervisor_topology = "cni-sidecar"` is rendered, the gateway should render
the same process container and long-running network sidecar as sidecar mode, but
there should be no `openshell-network-init` init container in sandbox pods.
Instead, the chart must install the privileged `openshell-cni` DaemonSet and the
sandbox pod should carry `openshell.ai/cni=enabled`,
`openshell.ai/network-enforcement-mode=cni-sidecar`, and
`openshell.ai/proxy-uid=<uid>` annotations. The CNI DaemonSet copies
`/openshell-cni` into the host CNI binary directory and patches an existing CNI
`.conflist`; if sandbox pods bypass network enforcement or fail during pod
network setup, inspect the DaemonSet logs, the host CNI config, and whether the
cluster actually invokes chained CNI plugins for the sandbox runtime class.

If `supervisor_topology = "proxy-pod"` is rendered, each sandbox should have a
separate supervisor Deployment with one supervisor pod, a headless supervisor
Service, a proxy CA Secret, and two per-sandbox NetworkPolicies. The agent pod
Expand All @@ -305,6 +317,9 @@ Inspect all three when sandbox registration or egress enforcement fails:
kubectl -n openshell get configmap openshell-config -o jsonpath='{.data.gateway\.toml}' | grep supervisor_topology
kubectl -n <sandbox-namespace> get pod <sandbox-pod> -o jsonpath='{range .spec.initContainers[*]}{.name}{" "}{.command}{"\n"}{end}'
kubectl -n <sandbox-namespace> get pod <sandbox-pod> -o jsonpath='{range .spec.containers[*]}{.name}{" "}{.command}{"\n"}{end}'
kubectl -n <sandbox-namespace> get pod <sandbox-pod> -o jsonpath='{.metadata.annotations}'
kubectl -n openshell get daemonset,pod -l app.kubernetes.io/component=cni
kubectl -n openshell logs daemonset/openshell-cni -c install-cni --tail=200
kubectl -n <sandbox-namespace> logs <sandbox-pod> -c openshell-network-init --tail=200
kubectl -n <sandbox-namespace> logs <sandbox-pod> -c openshell-supervisor-network --tail=200
kubectl -n <sandbox-namespace> logs <sandbox-pod> -c agent --tail=200
Expand Down Expand Up @@ -338,6 +353,7 @@ openshell logs <sandbox-name>
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs deployment/openshell -c openshell-gateway` or `kubectl -n openshell logs statefulset/openshell -c openshell-gateway` |
| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways/<name>/mtls/` |
| Image pull failure | Gateway or sandbox image cannot be pulled | Runtime events and image pull credentials |
| CNI-sidecar sandbox pods fail network setup | OpenShell CNI DaemonSet did not patch the node CNI conflist, cannot read pods, or the runtime class does not invoke the chained plugin | `kubectl -n openshell logs daemonset/openshell-cni -c install-cni`, chart `cni.*` values, host CNI config |
| `K8s namespace not ready` with `envoy-gateway-openshell.yaml: the server could not find the requested resource` | Optional Gateway API manifest was applied without Envoy Gateway CRDs, or k3s Helm controller startup exceeded the namespace wait | Apply `deploy/kube/manifests/envoy-gateway-openshell.yaml` manually only after Envoy Gateway is installed and `grpcRoute` is enabled |

## Reporting
Expand Down
22 changes: 21 additions & 1 deletion .agents/skills/helm-dev-environment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,11 @@ mise run helm:skaffold:run
mise run helm:skaffold:run:sidecar
```

**Supervisor CNI-sidecar topology** (build once and leave running):
```bash
mise run helm:skaffold:run:cni-sidecar
```

**Supervisor proxy-pod topology** (build once and leave running):
```bash
mise run helm:skaffold:run:proxy-pod
Expand All @@ -73,7 +78,9 @@ mise run helm:skaffold:run:proxy-pod
All Skaffold commands build the `gateway` and `supervisor` images and deploy the OpenShell Helm
chart. The sidecar profile renders an `openshell-network-init` init container for
nftables setup and a non-root `openshell-supervisor-network` runtime sidecar for
proxying. The proxy-pod profile renders network supervision in a separate
proxying. The cni-sidecar profile enables the privileged OpenShell CNI
DaemonSet and uses the sidecar runtime model without the pod-local network init
container. The proxy-pod profile renders network supervision in a separate
supervisor Deployment with one pod and relies on Kubernetes NetworkPolicy
enforcement so the agent pod can reach only its paired supervisor plus DNS. The
default local k3s/k3d cluster keeps k3s's embedded NetworkPolicy controller
Expand Down Expand Up @@ -104,6 +111,12 @@ Run the sidecar topology e2e environment:
mise run e2e:kubernetes:sidecar
```

Run the CNI-sidecar topology e2e environment:

```bash
mise run e2e:kubernetes:cni-sidecar
```

Run the proxy-pod topology e2e environment:

```bash
Expand Down Expand Up @@ -176,6 +189,12 @@ For a sidecar-profile deployment:
mise run helm:skaffold:delete:sidecar
```

For a cni-sidecar-profile deployment:

```bash
mise run helm:skaffold:delete:cni-sidecar
```

For a proxy-pod-profile deployment:

```bash
Expand Down Expand Up @@ -307,6 +326,7 @@ for dependencies still declared in `Chart.yaml`.
| `deploy/helm/openshell/ci/values-high-availability.yaml` | HA test overlay (`replicaCount: 2` with external PostgreSQL Secret) |
| `deploy/helm/openshell/ci/values-keycloak.yaml` | Keycloak OIDC overlay |
| `deploy/helm/openshell/ci/values-sidecar.yaml` | Supervisor sidecar topology overlay for Kubernetes e2e/dev |
| `deploy/helm/openshell/ci/values-cni-sidecar.yaml` | Supervisor CNI-sidecar topology overlay for Kubernetes e2e/dev; enables the OpenShell CNI DaemonSet |
| `deploy/helm/openshell/ci/values-proxy-pod.yaml` | Supervisor proxy-pod topology overlay for Kubernetes e2e/dev; requires NetworkPolicy enforcement |
| `deploy/helm/openshell/ci/values-spire.yaml` | SPIFFE/SPIRE provider token grant overlay |
| `deploy/helm/openshell/ci/values-spire-stack.yaml` | SPIRE hardened chart values for local dev |
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/docker-build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -246,6 +246,9 @@ jobs:
fi
mkdir -p "$stage"
install -m 0755 "$found" "$stage/$binary"
if [[ "${{ inputs.component }}" == "supervisor" ]]; then
PREBUILT_ARCH="${{ matrix.arch }}" tasks/scripts/stage-prebuilt-binaries.sh cni
fi
ls -lh "$stage/"

- name: Build ${{ inputs.component }} image
Expand Down
14 changes: 14 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 11 additions & 7 deletions architecture/compute-runtimes.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,19 +89,23 @@ Driver-controlled environment variables must override sandbox image or template
values for sandbox ID, sandbox name, gateway endpoint, relay socket path, TLS
paths, and command metadata.

Kubernetes can run the supervisor in the default combined topology or in a
sidecar topology. Combined mode keeps network and process supervision in the
Kubernetes can run the supervisor in combined, sidecar, cni-sidecar, or
proxy-pod topology. Combined mode keeps network and process supervision in the
agent container. Sidecar mode runs network enforcement, the proxy, and gateway
loopback forwarding in a dedicated sidecar, while the agent container runs only
the process-supervision leaf and launches the user workload after the sidecar
signals readiness. In sidecar mode, an init container performs the privileged
pod-network nftables setup with `NET_ADMIN` and hands shared state ownership to
the configured proxy UID; the long-running network sidecar runs as that UID and
does not keep `NET_ADMIN`. The agent container runs as the resolved sandbox
UID/GID with no added Linux capabilities. Sidecar mode preserves gateway session
and SSH behavior, but treats the process leaf as network-only: Landlock
filesystem policy, process privilege dropping, and process/binary identity
checks are not applied there.
does not keep `NET_ADMIN`. CNI-sidecar mode keeps the sidecar runtime model but
requires the privileged OpenShell CNI DaemonSet to install the pod-network rules
during CNI `ADD` using nftables or iptables. Proxy-pod mode moves network
enforcement into a paired supervisor Deployment and requires NetworkPolicy
enforcement. The agent container runs as the resolved sandbox UID/GID with no
added Linux capabilities in the alternate topologies. They preserve gateway
session and SSH behavior, but
treat the process leaf as network-only: Landlock filesystem policy, process
privilege dropping, and process/binary identity checks are not applied there.

## Images

Expand Down
29 changes: 29 additions & 0 deletions crates/openshell-cni/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

[package]
name = "openshell-cni"
description = "OpenShell chained CNI plugin for Kubernetes sidecar network enforcement"
version.workspace = true
edition.workspace = true
license.workspace = true
repository.workspace = true
rust-version.workspace = true

[dependencies]
base64 = { workspace = true }
miette = { workspace = true }
reqwest = { workspace = true, features = ["blocking"] }
serde = { workspace = true }
serde_json = { workspace = true }
serde_yml = { workspace = true }
tempfile = "3"

[target.'cfg(target_os = "linux")'.dependencies]
libc = "0.2"

[dev-dependencies]
tempfile = "3"

[lints]
workspace = true
Loading
Loading