Skip to content

feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog#289

Draft
morgan-wowk wants to merge 1 commit into
masterfrom
pending-deadline-watchdog
Draft

feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog#289
morgan-wowk wants to merge 1 commit into
masterfrom
pending-deadline-watchdog

Conversation

@morgan-wowk

@morgan-wowk morgan-wowk commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

What this changes

Teaches the orchestrator to give up on a task that can never start. Today a pod stuck in Pending (e.g. a gcsfuse mount wedge: MountVolume.SetUp failed ... code = UnauthenticatedCreateContainerConfigError) is polled forever with no deadline and no terminal state — one such run sat stuck for 6 days. This adds a pending deadline that terminates the pod and fails the execution with the real kubelet reason.

Tracking: Shopify/oasis-backend#413

Before / after

Before

  • Pod wedged in Pending → orchestrator logs remains in PENDING state and returns, every poll, indefinitely.
  • The execution never reaches a terminal state; started_at stays None; downstream tasks wait forever.
  • No signal to the user that anything is wrong — the run simply hangs.

After (once a deployment sets max_pending_duration)

  • Past the deadline, the orchestrator terminates the pod and marks the execution SYSTEM_ERROR, skipping downstream — the run fails fast instead of hanging.
  • The error message carries the actual cause, e.g.:
    Task pending 0:27:13, never started (deadline 0:20:00).
    CreateContainerConfigError: MountVolume.SetUp failed for volume "gcsfuse-prd-oasis-tmp":
      code = Unauthenticated desc = failed to prepare storage service
    

The changes

PENDING-deadline watchdog (orchestrator_sql.py)

  • New optional orchestrator param max_pending_duration: timedelta | None. Defaults to None (disabled) — this PR is a no-op in production until oasis-backend passes a value (separate submodule bump). Merging it carries no behavior change or risk.
  • New pure helper _pending_deadline_exceeded(created_at, now, max_pending_duration): returns False when disabled or when created_at is unknown, so legacy rows are never force-failed. Boundary is strict (>), so it fires only past the deadline.
  • In internal_process_one_running_execution, before the PENDING == PENDING early return: when the deadline is exceeded, upload logs, call terminate(), and raise OrchestratorError(<message>).
  • Reuses existing teardown. The raise is caught by the existing outer handler, which already sets SYSTEM_ERROR, records the orchestration error on each node, and calls _mark_all_downstream_executions_as_skipped. No new state-machine code, no new failure path to maintain.
  • created_at is the correct anchor because started_at is None for a never-started container.

Real kubelet reason in the error message (kubernetes_launchers.py)

  • New pending_diagnostics property on LaunchedKubernetesContainer returns the main container's waiting reason + message, read from the already-cached _debug_pod — no extra Kubernetes API calls. This is the signal that carries the wedge cause (CreateContainerConfigError + the mount failure).
  • Fixes the root gap: the existing launcher_error_message reads only the terminated state, which never exists on a boot wedge, so the reason was being dropped.
  • No interface change. The property lives only on the K8s launcher; the orchestrator reads it via getattr(..., "pending_diagnostics", None), so the LaunchedContainer interface and its other implementations are untouched. When unavailable, the message falls back to the generic timeout line.

Threshold guidance

max_pending_duration must exceed real Kueue admission / scheduling latency or it will false-positive on healthy queued work. A value around 20–30 min is a safe starting point. It is configurable precisely so it can be tuned in oasis-backend without a code change.

Tests

tests/test_pending_deadline_watchdog.py:

  • _pending_deadline_exceeded: disabled (None), unknown created_at, under threshold, past threshold, strict boundary.
  • pending_diagnostics: gcsfuse mount-wedge message surfaced; no main-container status → None.

All green; black clean.

Out of scope (follow-ups)

  • oasis-backend wiring to set max_pending_duration and actually enable the watchdog (submodule bump).
  • Stop cache-reuse of stale PENDING nodes, so a resubmit doesn't re-link to the dead node.

is still pending and no terminated state exists. Returns None when there
is nothing useful to report.
"""
return None

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs review. Must be careful with interface changes.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted — no interface change. pending_diagnostics now lives only on LaunchedKubernetesContainer, and the orchestrator reads it via getattr(..., "pending_diagnostics", None), so the base LaunchedContainer interface and its other implementations are untouched.

if len(lines) <= 1 and not pod_status.message:
# Only the bare header — nothing actionable to report.
return None
return "\n".join(lines)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of lines diff. Consider how much value / impact is really being achieved here and whether there's a much shorter and simpler solution.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed from ~64 lines to 8. It now returns just the main container's waiting reason + message, which is the single signal that carries the actual wedge (CreateContainerConfigError + the MountVolume.SetUp failed ... Unauthenticated text from the incident). Dropped the init-container and pod-condition handling — the watchdog still fires and terminates in those cases; the message just falls back to the generic timeout line.

…ine watchdog

A pod that can never boot (e.g. a gcsfuse CSI-node mount wedge:
MountVolume.SetUp failed ... code = Unauthenticated -> CreateContainerConfigError)
stays in phase Pending forever. The orchestrator polls it indefinitely with no
deadline, never terminating it or marking it SYSTEM_ERROR, so a run can sit stuck
for days.

PENDING-deadline watchdog: add an optional max_pending_duration to the
orchestrator. In internal_process_one_running_execution, when a container is still
PENDING past the deadline, terminate it and raise OrchestratorError; the existing
outer handler marks it SYSTEM_ERROR, records the error, and skips downstream. The
deadline check is a pure helper (_pending_deadline_exceeded) and defaults to
disabled (None), so behavior is unchanged until a deployment opts in. Rows without
created_at are never force-failed.

Real kubelet reason in the error: add a pending_diagnostics property to the
Kubernetes launcher that returns the main container's waiting reason and message
(e.g. CreateContainerConfigError + the MountVolume.SetUp failure), so the
SYSTEM_ERROR carries the real boot failure instead of a bare timeout. The
orchestrator reads it via getattr, so no launcher interface change is needed.
@morgan-wowk morgan-wowk force-pushed the pending-deadline-watchdog branch from 1dd4523 to 488d892 Compare June 30, 2026 20:05
@morgan-wowk morgan-wowk changed the title feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog (Options A+B) feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant