feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog#289
feat: Orchestrator - Terminate PENDING-wedged executions with a deadline watchdog#289morgan-wowk wants to merge 1 commit into
Conversation
| is still pending and no terminated state exists. Returns None when there | ||
| is nothing useful to report. | ||
| """ | ||
| return None |
There was a problem hiding this comment.
Needs review. Must be careful with interface changes.
There was a problem hiding this comment.
Reverted — no interface change. pending_diagnostics now lives only on LaunchedKubernetesContainer, and the orchestrator reads it via getattr(..., "pending_diagnostics", None), so the base LaunchedContainer interface and its other implementations are untouched.
| if len(lines) <= 1 and not pod_status.message: | ||
| # Only the bare header — nothing actionable to report. | ||
| return None | ||
| return "\n".join(lines) |
There was a problem hiding this comment.
This is a lot of lines diff. Consider how much value / impact is really being achieved here and whether there's a much shorter and simpler solution.
There was a problem hiding this comment.
Trimmed from ~64 lines to 8. It now returns just the main container's waiting reason + message, which is the single signal that carries the actual wedge (CreateContainerConfigError + the MountVolume.SetUp failed ... Unauthenticated text from the incident). Dropped the init-container and pod-condition handling — the watchdog still fires and terminates in those cases; the message just falls back to the generic timeout line.
…ine watchdog A pod that can never boot (e.g. a gcsfuse CSI-node mount wedge: MountVolume.SetUp failed ... code = Unauthenticated -> CreateContainerConfigError) stays in phase Pending forever. The orchestrator polls it indefinitely with no deadline, never terminating it or marking it SYSTEM_ERROR, so a run can sit stuck for days. PENDING-deadline watchdog: add an optional max_pending_duration to the orchestrator. In internal_process_one_running_execution, when a container is still PENDING past the deadline, terminate it and raise OrchestratorError; the existing outer handler marks it SYSTEM_ERROR, records the error, and skips downstream. The deadline check is a pure helper (_pending_deadline_exceeded) and defaults to disabled (None), so behavior is unchanged until a deployment opts in. Rows without created_at are never force-failed. Real kubelet reason in the error: add a pending_diagnostics property to the Kubernetes launcher that returns the main container's waiting reason and message (e.g. CreateContainerConfigError + the MountVolume.SetUp failure), so the SYSTEM_ERROR carries the real boot failure instead of a bare timeout. The orchestrator reads it via getattr, so no launcher interface change is needed.
1dd4523 to
488d892
Compare
What this changes
Teaches the orchestrator to give up on a task that can never start. Today a pod stuck in
Pending(e.g. a gcsfuse mount wedge:MountVolume.SetUp failed ... code = Unauthenticated→CreateContainerConfigError) is polled forever with no deadline and no terminal state — one such run sat stuck for 6 days. This adds a pending deadline that terminates the pod and fails the execution with the real kubelet reason.Tracking: Shopify/oasis-backend#413
Before / after
Before
Pending→ orchestrator logsremains in PENDING stateand returns, every poll, indefinitely.started_atstaysNone; downstream tasks wait forever.After (once a deployment sets
max_pending_duration)SYSTEM_ERROR, skipping downstream — the run fails fast instead of hanging.The changes
PENDING-deadline watchdog (
orchestrator_sql.py)max_pending_duration: timedelta | None. Defaults toNone(disabled) — this PR is a no-op in production until oasis-backend passes a value (separate submodule bump). Merging it carries no behavior change or risk._pending_deadline_exceeded(created_at, now, max_pending_duration): returnsFalsewhen disabled or whencreated_atis unknown, so legacy rows are never force-failed. Boundary is strict (>), so it fires only past the deadline.internal_process_one_running_execution, before thePENDING == PENDINGearly return: when the deadline is exceeded, upload logs, callterminate(), andraise OrchestratorError(<message>).SYSTEM_ERROR, records the orchestration error on each node, and calls_mark_all_downstream_executions_as_skipped. No new state-machine code, no new failure path to maintain.created_atis the correct anchor becausestarted_atisNonefor a never-started container.Real kubelet reason in the error message (
kubernetes_launchers.py)pending_diagnosticsproperty onLaunchedKubernetesContainerreturns the main container'swaitingreason + message, read from the already-cached_debug_pod— no extra Kubernetes API calls. This is the signal that carries the wedge cause (CreateContainerConfigError+ the mount failure).launcher_error_messagereads only the terminated state, which never exists on a boot wedge, so the reason was being dropped.getattr(..., "pending_diagnostics", None), so theLaunchedContainerinterface and its other implementations are untouched. When unavailable, the message falls back to the generic timeout line.Threshold guidance
max_pending_durationmust exceed real Kueue admission / scheduling latency or it will false-positive on healthy queued work. A value around 20–30 min is a safe starting point. It is configurable precisely so it can be tuned in oasis-backend without a code change.Tests
tests/test_pending_deadline_watchdog.py:_pending_deadline_exceeded: disabled (None), unknowncreated_at, under threshold, past threshold, strict boundary.pending_diagnostics: gcsfuse mount-wedge message surfaced; no main-container status →None.All green;
blackclean.Out of scope (follow-ups)
max_pending_durationand actually enable the watchdog (submodule bump).PENDINGnodes, so a resubmit doesn't re-link to the dead node.