feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog by morgan-wowk · Pull Request #289 · TangleML/tangle

morgan-wowk · 2026-06-30T19:24:10Z

What this changes

Teaches the orchestrator to give up on a task that can never start. Today a pod stuck in Pending (e.g. a gcsfuse mount wedge: MountVolume.SetUp failed ... code = Unauthenticated → CreateContainerConfigError) is polled forever with no deadline and no terminal state — one such run sat stuck for 6 days. This adds a pending deadline that terminates the pod and fails the execution with the real kubelet reason.

Tracking: Shopify/oasis-backend#413

Before / after

Before

Pod wedged in Pending → orchestrator logs remains in PENDING state and returns, every poll, indefinitely.
The execution never reaches a terminal state; started_at stays None; downstream tasks wait forever.
No signal to the user that anything is wrong — the run simply hangs.

After (once a deployment sets max_pending_duration)

Past the deadline, the orchestrator terminates the pod and marks the execution SYSTEM_ERROR, skipping downstream — the run fails fast instead of hanging.

The error message carries the actual cause, e.g.:

Task pending 0:27:13, never started (deadline 0:20:00).
CreateContainerConfigError: MountVolume.SetUp failed for volume "gcsfuse-prd-oasis-tmp":
  code = Unauthenticated desc = failed to prepare storage service

The changes

PENDING-deadline watchdog (`orchestrator_sql.py`)

New optional orchestrator param max_pending_duration: timedelta | None. Defaults to None (disabled) — this PR is a no-op in production until oasis-backend passes a value (separate submodule bump). Merging it carries no behavior change or risk.
New pure helper _pending_deadline_exceeded(created_at, now, max_pending_duration): returns False when disabled or when created_at is unknown, so legacy rows are never force-failed. Boundary is strict (>), so it fires only past the deadline.
In internal_process_one_running_execution, before the PENDING == PENDING early return: when the deadline is exceeded, upload logs, call terminate(), and raise OrchestratorError(<message>).
Reuses existing teardown. The raise is caught by the existing outer handler, which already sets SYSTEM_ERROR, records the orchestration error on each node, and calls _mark_all_downstream_executions_as_skipped. No new state-machine code, no new failure path to maintain.
created_at is the correct anchor because started_at is None for a never-started container.

Real kubelet reason in the error message (`kubernetes_launchers.py`)

New pending_diagnostics property on LaunchedKubernetesContainer returns the main container's waiting reason + message, read from the already-cached _debug_pod — no extra Kubernetes API calls. This is the signal that carries the wedge cause (CreateContainerConfigError + the mount failure).
Fixes the root gap: the existing launcher_error_message reads only the terminated state, which never exists on a boot wedge, so the reason was being dropped.
No interface change. The property lives only on the K8s launcher; the orchestrator reads it via getattr(..., "pending_diagnostics", None), so the LaunchedContainer interface and its other implementations are untouched. When unavailable, the message falls back to the generic timeout line.

Threshold guidance

max_pending_duration must exceed real Kueue admission / scheduling latency or it will false-positive on healthy queued work. A value around 20–30 min is a safe starting point. It is configurable precisely so it can be tuned in oasis-backend without a code change.

Tests

tests/test_pending_deadline_watchdog.py:

_pending_deadline_exceeded: disabled (None), unknown created_at, under threshold, past threshold, strict boundary.
pending_diagnostics: gcsfuse mount-wedge message surfaced; no main-container status → None.

All green; black clean.

Out of scope (follow-ups)

oasis-backend wiring to set max_pending_duration and actually enable the watchdog (submodule bump).
Stop cache-reuse of stale PENDING nodes, so a resubmit doesn't re-link to the dead node.

morgan-wowk · 2026-06-30T19:49:56Z

+        is still pending and no terminated state exists. Returns None when there
+        is nothing useful to report.
+        """
+        return None


Needs review. Must be careful with interface changes.

Reverted — no interface change. pending_diagnostics now lives only on LaunchedKubernetesContainer, and the orchestrator reads it via getattr(..., "pending_diagnostics", None), so the base LaunchedContainer interface and its other implementations are untouched.

morgan-wowk · 2026-06-30T19:51:02Z

+        if len(lines) <= 1 and not pod_status.message:
+            # Only the bare header — nothing actionable to report.
+            return None
+        return "\n".join(lines)


This is a lot of lines diff. Consider how much value / impact is really being achieved here and whether there's a much shorter and simpler solution.

Trimmed from ~64 lines to 8. It now returns just the main container's waiting reason + message, which is the single signal that carries the actual wedge (CreateContainerConfigError + the MountVolume.SetUp failed ... Unauthenticated text from the incident). Dropped the init-container and pod-condition handling — the watchdog still fires and terminates in those cases; the message just falls back to the generic timeout line.

…ine watchdog A pod that can never boot (e.g. a gcsfuse CSI-node mount wedge: MountVolume.SetUp failed ... code = Unauthenticated -> CreateContainerConfigError) stays in phase Pending forever. The orchestrator polls it indefinitely with no deadline, never terminating it or marking it SYSTEM_ERROR, so a run can sit stuck for days. PENDING-deadline watchdog: add an optional max_pending_duration to the orchestrator. In internal_process_one_running_execution, when a container is still PENDING past the deadline, terminate it and raise OrchestratorError; the existing outer handler marks it SYSTEM_ERROR, records the error, and skips downstream. The deadline check is a pure helper (_pending_deadline_exceeded) and defaults to disabled (None), so behavior is unchanged until a deployment opts in. Rows without created_at are never force-failed. Real kubelet reason in the error: add a pending_diagnostics property to the Kubernetes launcher that returns the main container's waiting reason and message (e.g. CreateContainerConfigError + the MountVolume.SetUp failure), so the SYSTEM_ERROR carries the real boot failure instead of a bare timeout. The orchestrator reads it via getattr, so no launcher interface change is needed.

morgan-wowk commented Jun 30, 2026

View reviewed changes

morgan-wowk force-pushed the pending-deadline-watchdog branch from 1dd4523 to 488d892 Compare June 30, 2026 20:05

morgan-wowk changed the title ~~feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog (Options A+B)~~ feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog#289

feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog#289
morgan-wowk wants to merge 1 commit into
masterfrom
pending-deadline-watchdog

morgan-wowk commented Jun 30, 2026 •

edited

Loading

Uh oh!

morgan-wowk Jun 30, 2026

Uh oh!

morgan-wowk Jun 30, 2026

Uh oh!

morgan-wowk Jun 30, 2026

Uh oh!

morgan-wowk Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

morgan-wowk commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

Before / after

The changes

PENDING-deadline watchdog (orchestrator_sql.py)

Real kubelet reason in the error message (kubernetes_launchers.py)

Threshold guidance

Tests

Out of scope (follow-ups)

Uh oh!

morgan-wowk Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

morgan-wowk Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

morgan-wowk Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

morgan-wowk Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

morgan-wowk commented Jun 30, 2026 •

edited

Loading

PENDING-deadline watchdog (`orchestrator_sql.py`)

Real kubelet reason in the error message (`kubernetes_launchers.py`)