Skip to content

experimental/ssh: wait full startup timeout for server health check#5807

Open
TanishqDatabricks wants to merge 1 commit into
mainfrom
ssh-health-check-timeout
Open

experimental/ssh: wait full startup timeout for server health check#5807
TanishqDatabricks wants to merge 1 commit into
mainfrom
ssh-health-check-timeout

Conversation

@TanishqDatabricks

Copy link
Copy Markdown
Collaborator

Changes

databricks ssh connect health-checks the bootstrap SSH server after its job task reaches RUNNING by polling the driver-proxy /metadata endpoint. That poll used a hardcoded 30 × 2s = 60s window.

The problem: after the task reaches RUNNING, the driver proxy still answers /metadata with 503 until the container's HTTP endpoint is actually reachable. With a custom --base-environment, that environment install happens after the task is RUNNING, so warmup routinely outlasts 60s — and ssh connect fails with server is not ok, status code 503 even though the server comes up fine moments later. The 503 originates in the serverless driver-proxy layer, not our server binary (serveMetadata only ever returns 200/500).

This replaces the fixed loop with retries.Poll bounded by the same accelerator-aware budget already used to wait for the task to start (TaskStartupTimeout: 10 min CPU / 45 min GPU). The SDK poller backs off between attempts (linear, capped ~10s, jittered) instead of hammering the proxy every 2s during a long install, and it still succeeds immediately on the first attempt when the server is already warm.

formatServerStartError is extracted so the two failure modes stay distinct: a halted poll (bootstrap job terminated) already carries the job's failure trace and is returned as-is, while a timeout appends describeRunFailure diagnostics. This preserves the prior behavior and avoids printing the failure trace twice.

Why

Serverless SSH connections with a custom base environment (introduced in #5706) frequently fail on first connect with a driver-proxy 503, because installing the base environment pushes health-readiness past the fixed 60s window. Sizing the health-check wait to match the provisioning wait the user already opted into fixes this without a new flag.

Tests

  • New unit tests TestFormatServerStartErrorTimeoutAppendsRunFailure and TestFormatServerStartErrorHaltReturnedAsIs cover both error-routing branches (timeout appends run diagnostics; halt returns as-is with the trace exactly once).
  • go test ./experimental/ssh/... passes.
  • TestAccept/ssh acceptance suite passes: ok github.com/databricks/cli/acceptance 481.255s.

Local run timing

Measured locally (Apple Silicon). Note the acceptance figure includes compiling the full CLI binary on a cold build cache:

Stage Time
go mod download (deps, warm module cache) ~6.6s
go test ./experimental/ssh/... (unit, compile + run) ~7.9s
TestAccept/ssh (acceptance, incl. full CLI compile) 481.3s (~8m)

The bulk of the acceptance wall-clock is the one-time CLI/dependency compilation on a cold go-build cache; the ssh test execution itself is a small fraction of that.

This pull request and its description were written by Isaac.

The post-RUNNING health check against the driver-proxy /metadata endpoint
used a fixed 30 x 2s = 60s window. After the bootstrap task reaches RUNNING
the driver proxy still answers 503 until the container's HTTP endpoint is
reachable; with a custom --base-environment that install happens post-RUNNING,
so warmup routinely outlasts 60s and `ssh connect` fails with a driver-proxy
503 even though the server comes up fine.

Poll with retries.Poll for the same accelerator-aware budget already used for
the task to start (10m CPU / 45m GPU), backing off between attempts instead of
hammering the proxy every 2s during a long wait.

Extract formatServerStartError so a terminated bootstrap job returns its
failure trace once, while only a timeout appends run diagnostics.

Co-authored-by: Isaac
@TanishqDatabricks TanishqDatabricks force-pushed the ssh-health-check-timeout branch from cc2f502 to b99d10c Compare July 2, 2026 14:37
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Waiting for approval

Based on git history, these people are best suited to review:

  • @anton-107 -- recent work in experimental/ssh/internal/client/, ./

Eligible reviewers: @andrewnester, @denik, @pietern, @renaudhartert-db, @shreyas-goenka, @simonfaltum

Suggestions based on git history. See OWNERS for ownership rules.

@eng-dev-ecosystem-bot

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: b99d10c

Run: 28598550145

Env 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
💚​ aws linux 10 13 230 1041 4:09
💚​ aws windows 10 13 232 1039 4:11
💚​ aws-ucws linux 10 13 314 959 9:16
💚​ aws-ucws windows 10 13 316 957 4:13
💚​ azure linux 4 15 230 1040 4:14
💚​ azure windows 4 15 232 1038 3:51
💚​ azure-ucws linux 4 15 316 956 5:03
💚​ azure-ucws windows 4 15 318 954 4:03
💚​ gcp linux 4 15 229 1042 9:46
💚​ gcp windows 4 15 231 1040 5:10
23 interesting tests: 13 SKIP, 10 RECOVERED
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
💚​ TestAccept 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 💚​R 💚​R 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 💚​R 💚​R 💚​R 💚​R
💚​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestFetchRepositoryInfoAPI_FromRepo 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
💚​ TestFetchRepositoryInfoAPI_FromRepo/root 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
💚​ TestFetchRepositoryInfoAPI_FromRepo/subdir 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
Top 8 slowest tests (at least 2 minutes):
duration env testname
7:49 gcp linux TestSecretsPutSecretStringValue
7:34 aws-ucws linux TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform
4:24 gcp windows TestSecretsPutSecretStringValue
3:07 aws windows TestAccept
3:02 aws-ucws windows TestAccept
2:59 azure-ucws windows TestAccept
2:59 azure windows TestAccept
2:35 gcp windows TestAccept

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants