Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 17 additions & 2 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -274,8 +274,8 @@ If `supervisor_topology = "sidecar"` is rendered, sandbox pods should have an
`openshell-supervisor-network` container running `--mode=network`. The init
container owns nftables setup and should be the only sidecar topology container
with `NET_ADMIN`. It also needs `CHOWN`/`FOWNER` to hand shared emptyDir state
to `sidecar_proxy_uid`. The long-running network sidecar runs as
`sidecar_proxy_uid` with primary GID `0` so it can read the root-owned,
to `proxy_uid`. The long-running network sidecar runs as
`proxy_uid` with primary GID `0` so it can read the root-owned,
group-readable projected service-account token. In sidecar topology the
`openshell-sa-token` projected volume should render `defaultMode: 288` (`0440`);
if the proxy logs `failed to read K8s SA token`, verify this token mode and the
Expand All @@ -284,6 +284,21 @@ workload entrypoint PID to `OPENSHELL_ENTRYPOINT_PID_FILE`
(`/run/openshell-sidecar/entrypoint.pid` by default), and the network sidecar
should read it for binary-scoped policy decisions; if allowed network rules are
all denied, inspect that file and the network sidecar logs.

If `supervisor_topology = "proxy-pod"` is rendered, each sandbox should have a
separate supervisor Deployment with one supervisor pod, a headless supervisor
Service, a proxy CA Secret, and two per-sandbox NetworkPolicies. The agent pod
should have `openshell.ai/sandbox-role=agent`; the supervisor pod should have
`openshell.ai/sandbox-role=supervisor`; both should share the same
`openshell.ai/sandbox-id`. The supervisor Deployment must have a controlling
`Sandbox` ownerReference. The Deployment pod template must carry the
`openshell.io/sandbox-id` annotation so the TokenReview bootstrap path can mint
a sandbox JWT. For supervisor pods, the gateway validates the
`Pod -> ReplicaSet -> Deployment -> Sandbox` owner chain, so missing
`apps/replicasets get` RBAC can also break bootstrap. If the agent cannot reach
the gateway, check DNS to the headless Service, the agent egress NetworkPolicy
DNS exception for kube-dns/CoreDNS, and the supervisor ingress NetworkPolicy
allowing only that agent pod on ports `3128` and `18080`.
Inspect all three when sandbox registration or egress enforcement fails:

```bash
Expand Down
47 changes: 45 additions & 2 deletions .agents/skills/helm-dev-environment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,21 @@ mise run helm:skaffold:run
mise run helm:skaffold:run:sidecar
```

Both commands build the `gateway` and `supervisor` images and deploy the OpenShell Helm
**Supervisor proxy-pod topology** (build once and leave running):
```bash
mise run helm:skaffold:run:proxy-pod
```

All Skaffold commands build the `gateway` and `supervisor` images and deploy the OpenShell Helm
chart. The sidecar profile renders an `openshell-network-init` init container for
nftables setup and a non-root `openshell-supervisor-network` runtime sidecar for
proxying. The `pkiInitJob` hook (a pre-install Job that runs `openshell-gateway
proxying. The proxy-pod profile renders network supervision in a separate
supervisor Deployment with one pod and relies on Kubernetes NetworkPolicy
enforcement so the agent pod can reach only its paired supervisor plus DNS. The
default local k3s/k3d cluster keeps k3s's embedded NetworkPolicy controller
enabled; if you replace the CNI, install a policy-enforcing CNI before using
proxy-pod. The
`pkiInitJob` hook (a pre-install Job that runs `openshell-gateway
generate-certs`) generates mTLS secrets on first install. Envoy Gateway opt-in;
see the Optional Add-ons section below.

Expand All @@ -79,6 +90,31 @@ The gateway Service uses ClusterIP. Access is via Envoy Gateway (port `8080`) or
create the Secret named `openshell-ha-pg` with a `uri` key, then run
`mise run helm:skaffold:run` or `mise run helm:skaffold:dev`.

### Kubernetes e2e profiles

Run the default Kubernetes e2e environment:

```bash
mise run e2e:kubernetes
```

Run the sidecar topology e2e environment:

```bash
mise run e2e:kubernetes:sidecar
```

Run the proxy-pod topology e2e environment:

```bash
mise run e2e:kubernetes:proxy-pod
```

The proxy-pod e2e task applies `ci/values-proxy-pod.yaml` through
`OPENSHELL_E2E_KUBE_EXTRA_VALUES`. Use an existing cluster with NetworkPolicy
enforcement, or let the wrapper create the default local k3d/k3s cluster with
k3s's embedded NetworkPolicy controller enabled.

### TLS behaviour

`ci/values-skaffold.yaml` sets `server.disableTls: true`, so Skaffold-based deploys run
Expand Down Expand Up @@ -140,6 +176,12 @@ For a sidecar-profile deployment:
mise run helm:skaffold:delete:sidecar
```

For a proxy-pod-profile deployment:

```bash
mise run helm:skaffold:delete:proxy-pod
```

### Delete the cluster entirely

```bash
Expand Down Expand Up @@ -265,6 +307,7 @@ for dependencies still declared in `Chart.yaml`.
| `deploy/helm/openshell/ci/values-high-availability.yaml` | HA test overlay (`replicaCount: 2` with external PostgreSQL Secret) |
| `deploy/helm/openshell/ci/values-keycloak.yaml` | Keycloak OIDC overlay |
| `deploy/helm/openshell/ci/values-sidecar.yaml` | Supervisor sidecar topology overlay for Kubernetes e2e/dev |
| `deploy/helm/openshell/ci/values-proxy-pod.yaml` | Supervisor proxy-pod topology overlay for Kubernetes e2e/dev; requires NetworkPolicy enforcement |
| `deploy/helm/openshell/ci/values-spire.yaml` | SPIFFE/SPIRE provider token grant overlay |
| `deploy/helm/openshell/ci/values-spire-stack.yaml` | SPIRE hardened chart values for local dev |
| `deploy/helm/openshell/ci/values-tls-disabled.yaml` | Lint-only: TLS + auth disabled (reverse-proxy edge termination) |
Expand Down
4 changes: 4 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ http-body-util = "0.1"
tokio-rustls = { version = "0.26", default-features = false, features = ["logging", "tls12", "ring"] }
rustls = { version = "0.23", default-features = false, features = ["std", "logging", "tls12", "ring"] }
rustls-pemfile = "2"
rcgen = { version = "0.13", features = ["crypto", "pem"] }
rcgen = { version = "0.13", features = ["crypto", "pem", "x509-parser"] }
webpki-roots = "1"

# CLI
Expand Down
8 changes: 5 additions & 3 deletions architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,11 @@ Podman, and VM drivers deliver the initial token through supervisor-only
runtime material; Kubernetes supervisors exchange a projected ServiceAccount
token through `IssueSandboxToken`. The gateway validates that projected token
with Kubernetes `TokenReview`, requires the configured sandbox service account,
checks the returned pod binding against the live pod UID, and verifies the pod's
controlling `Sandbox` ownerReference against the live Sandbox CR UID and
sandbox-id label before minting the gateway JWT. The bootstrap path accepts
checks the returned pod binding against the live pod UID, and verifies the
pod's ownership against the live Sandbox CR UID and sandbox-id label before
minting the gateway JWT. Agent pods must be directly controlled by the
`Sandbox` CR. Proxy-pod supervisor pods may be controlled through the Kubernetes
`Pod -> ReplicaSet -> Deployment -> Sandbox` chain. The bootstrap path accepts
both `agents.x-k8s.io/v1beta1` ownerReferences from newer Agent Sandbox
controllers and `agents.x-k8s.io/v1alpha1` ownerReferences from existing
deployments. Supervisors renew gateway JWTs in memory before expiry only while
Expand Down
20 changes: 20 additions & 0 deletions crates/openshell-core/src/sandbox_env.rs
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ pub const NETWORK_BINARY_IDENTITY: &str = "OPENSHELL_NETWORK_BINARY_IDENTITY";
/// File written by the network supervisor when sidecar networking is ready.
pub const SUPERVISOR_READY_FILE: &str = "OPENSHELL_SUPERVISOR_READY_FILE";

/// TCP address the process supervisor waits for before starting when the
/// network supervisor runs outside the agent process.
pub const SUPERVISOR_READY_ADDR: &str = "OPENSHELL_SUPERVISOR_READY_ADDR";

/// File written by the process supervisor with the workload entrypoint PID and
/// read by the network sidecar for process/binary-bound network policy checks.
pub const ENTRYPOINT_PID_FILE: &str = "OPENSHELL_ENTRYPOINT_PID_FILE";
Expand All @@ -66,10 +70,26 @@ pub const GATEWAY_FORWARD_ADDR: &str = "OPENSHELL_GATEWAY_FORWARD_ADDR";
/// gateway through a loopback TCP forward.
pub const GATEWAY_TLS_SERVER_NAME: &str = "OPENSHELL_GATEWAY_TLS_SERVER_NAME";

/// Explicit URL injected into sandbox child processes for proxy-mode egress.
///
/// Kubernetes proxy-pod topology uses a headless Service DNS name, which
/// cannot be represented by the policy's `SocketAddr` proxy field.
pub const PROXY_URL: &str = "OPENSHELL_PROXY_URL";

/// Explicit listener address for the network supervisor's HTTP CONNECT proxy.
pub const PROXY_BIND_ADDR: &str = "OPENSHELL_PROXY_BIND_ADDR";

/// Directory where the network supervisor writes the proxy CA files consumed
/// by workload child processes.
pub const PROXY_TLS_DIR: &str = "OPENSHELL_PROXY_TLS_DIR";

/// Optional CA certificate PEM path used by the network supervisor instead of
/// generating an ephemeral CA.
pub const PROXY_CA_CERT_PATH: &str = "OPENSHELL_PROXY_CA_CERT_PATH";

/// Optional CA private key PEM path paired with [`PROXY_CA_CERT_PATH`].
pub const PROXY_CA_KEY_PATH: &str = "OPENSHELL_PROXY_CA_KEY_PATH";

/// Path to the CA certificate for mTLS communication with the gateway.
pub const TLS_CA: &str = "OPENSHELL_TLS_CA";

Expand Down
1 change: 1 addition & 0 deletions crates/openshell-driver-kubernetes/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ tracing = { workspace = true }
tracing-subscriber = { workspace = true }
thiserror = { workspace = true }
miette = { workspace = true }
rcgen = { workspace = true }

[dev-dependencies]
temp-env = "0.3"
Expand Down
14 changes: 12 additions & 2 deletions crates/openshell-driver-kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,18 @@ In this mode OpenShell preserves gateway session and SSH behavior, but the
process supervisor defaults to network-only mode and does not apply Landlock
filesystem policy, process privilege dropping, or process/binary identity
checks. Network endpoint and L7 policy remain enforced by the network sidecar.
Set `process_enforcement = "full"` only when you want combined-mode
process/filesystem guards and accept the added agent-container permissions.

The `proxy-pod` supervisor topology runs network enforcement and gateway
forwarding in a separate supervisor Deployment with one pod. The agent pod runs
only the process-mode supervisor and reaches the supervisor through a
per-sandbox headless Service. The driver creates an owner-referenced supervisor
Deployment with one replica plus Service, proxy CA Secret, and NetworkPolicy
resources so agent egress is limited to its paired supervisor pod plus DNS. If
the supervisor pod is deleted, the Deployment recreates it.

Set `process_enforcement = "full"` in sidecar or proxy-pod topology only when
you want combined-mode process/filesystem guards and accept the added
agent-container permissions.

Sidecar mode uses the pod `fsGroup` to make the projected service-account token
and sandbox client TLS secret group-readable so the non-root process supervisor
Expand Down
36 changes: 26 additions & 10 deletions crates/openshell-driver-kubernetes/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ pub const DEFAULT_SANDBOX_SERVICE_ACCOUNT_NAME: &str = "default";
/// Default storage size for the workspace PVC.
pub const DEFAULT_WORKSPACE_STORAGE_SIZE: &str = "2Gi";

/// Default UID for the long-running Kubernetes network supervisor sidecar.
/// Default UID for the long-running Kubernetes network proxy.
pub const DEFAULT_PROXY_UID: u32 = 1337;

/// How the supervisor binary is delivered into sandbox pods.
Expand Down Expand Up @@ -65,13 +65,17 @@ pub enum SupervisorTopology {
/// Run network supervision in a privileged sidecar and process supervision
/// as a low-capability wrapper in the agent container.
Sidecar,
/// Run network supervision in a separate supervisor pod and process
/// supervision as a low-capability wrapper in the agent pod.
ProxyPod,
}

impl std::fmt::Display for SupervisorTopology {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
match self {
Self::Combined => f.write_str("combined"),
Self::Sidecar => f.write_str("sidecar"),
Self::ProxyPod => f.write_str("proxy-pod"),
}
}
}
Expand All @@ -83,22 +87,23 @@ impl FromStr for SupervisorTopology {
match s {
"combined" => Ok(Self::Combined),
"sidecar" => Ok(Self::Sidecar),
"proxy-pod" => Ok(Self::ProxyPod),
other => Err(format!("unknown supervisor topology '{other}'")),
}
}
}

/// Process/filesystem controls applied by the process supervisor in split
/// Kubernetes topologies.
/// Process/filesystem controls applied by the process supervisor in
/// non-combined Kubernetes topologies.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default, Serialize, Deserialize)]
#[serde(rename_all = "kebab-case")]
pub enum ProcessEnforcementMode {
/// Preserve process launch and session relay behavior, but leave
/// filesystem/process guards to the network supervisor topology.
/// Preserve process launch and session relay behavior while network
/// enforcement is handled by the sidecar or proxy pod.
#[default]
NetworkOnly,
/// Run the process supervisor with the same process/filesystem controls as
/// combined topology.
/// Run the process supervisor with combined-mode process/filesystem
/// controls.
Full,
}

Expand Down Expand Up @@ -255,9 +260,10 @@ pub struct KubernetesComputeConfig {
/// non-combined topologies. `network-only` keeps the low-permission agent
/// shape; `full` grants the agent supervisor combined-mode controls.
pub process_enforcement: ProcessEnforcementMode,
/// UID used by the long-running network sidecar in `sidecar` topology.
/// The network init container installs nftables rules that exempt this
/// UID, so it must not match the sandbox workload UID.
/// UID used by the long-running network proxy in sidecar and proxy-pod
/// topologies. In sidecar topology, the network init container installs
/// nftables rules that exempt this UID, so it must not match the sandbox
/// workload UID.
pub proxy_uid: u32,
pub grpc_endpoint: String,
pub ssh_socket_path: String,
Expand Down Expand Up @@ -540,6 +546,16 @@ mod tests {
assert_eq!(cfg.supervisor_topology, SupervisorTopology::Combined);
}

#[test]
fn serde_override_supervisor_topology_proxy_pod() {
let json = serde_json::json!({
"supervisor_topology": "proxy-pod"
});
let cfg: KubernetesComputeConfig = serde_json::from_value(json).unwrap();
assert_eq!(cfg.supervisor_topology, SupervisorTopology::ProxyPod);
assert_eq!(cfg.supervisor_topology.to_string(), "proxy-pod");
}

#[test]
fn serde_override_process_enforcement_full() {
let json = serde_json::json!({
Expand Down
Loading
Loading