experimental/ssh: wait full startup timeout for server health check#5807
Open
TanishqDatabricks wants to merge 1 commit into
Open
experimental/ssh: wait full startup timeout for server health check#5807TanishqDatabricks wants to merge 1 commit into
TanishqDatabricks wants to merge 1 commit into
Conversation
The post-RUNNING health check against the driver-proxy /metadata endpoint used a fixed 30 x 2s = 60s window. After the bootstrap task reaches RUNNING the driver proxy still answers 503 until the container's HTTP endpoint is reachable; with a custom --base-environment that install happens post-RUNNING, so warmup routinely outlasts 60s and `ssh connect` fails with a driver-proxy 503 even though the server comes up fine. Poll with retries.Poll for the same accelerator-aware budget already used for the task to start (10m CPU / 45m GPU), backing off between attempts instead of hammering the proxy every 2s during a long wait. Extract formatServerStartError so a terminated bootstrap job returns its failure trace once, while only a timeout appends run diagnostics. Co-authored-by: Isaac
cc2f502 to
b99d10c
Compare
Contributor
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Collaborator
Integration test reportCommit: b99d10c
23 interesting tests: 13 SKIP, 10 RECOVERED
Top 8 slowest tests (at least 2 minutes):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
databricks ssh connecthealth-checks the bootstrap SSH server after its job task reachesRUNNINGby polling the driver-proxy/metadataendpoint. That poll used a hardcoded30 × 2s = 60swindow.The problem: after the task reaches
RUNNING, the driver proxy still answers/metadatawith 503 until the container's HTTP endpoint is actually reachable. With a custom--base-environment, that environment install happens after the task isRUNNING, so warmup routinely outlasts 60s — andssh connectfails withserver is not ok, status code 503even though the server comes up fine moments later. The 503 originates in the serverless driver-proxy layer, not our server binary (serveMetadataonly ever returns 200/500).This replaces the fixed loop with
retries.Pollbounded by the same accelerator-aware budget already used to wait for the task to start (TaskStartupTimeout: 10 min CPU / 45 min GPU). The SDK poller backs off between attempts (linear, capped ~10s, jittered) instead of hammering the proxy every 2s during a long install, and it still succeeds immediately on the first attempt when the server is already warm.formatServerStartErroris extracted so the two failure modes stay distinct: a halted poll (bootstrap job terminated) already carries the job's failure trace and is returned as-is, while a timeout appendsdescribeRunFailurediagnostics. This preserves the prior behavior and avoids printing the failure trace twice.Why
Serverless SSH connections with a custom base environment (introduced in #5706) frequently fail on first connect with a driver-proxy 503, because installing the base environment pushes health-readiness past the fixed 60s window. Sizing the health-check wait to match the provisioning wait the user already opted into fixes this without a new flag.
Tests
TestFormatServerStartErrorTimeoutAppendsRunFailureandTestFormatServerStartErrorHaltReturnedAsIscover both error-routing branches (timeout appends run diagnostics; halt returns as-is with the trace exactly once).go test ./experimental/ssh/...passes.TestAccept/sshacceptance suite passes:ok github.com/databricks/cli/acceptance 481.255s.Local run timing
Measured locally (Apple Silicon). Note the acceptance figure includes compiling the full CLI binary on a cold build cache:
go mod download(deps, warm module cache)go test ./experimental/ssh/...(unit, compile + run)TestAccept/ssh(acceptance, incl. full CLI compile)The bulk of the acceptance wall-clock is the one-time CLI/dependency compilation on a cold
go-buildcache; the ssh test execution itself is a small fraction of that.This pull request and its description were written by Isaac.