experimental/ssh: wait full startup timeout for server health check by TanishqDatabricks · Pull Request #5807 · databricks/cli

TanishqDatabricks · 2026-07-02T14:36:56Z

Changes

databricks ssh connect health-checks the bootstrap SSH server after its job task reaches RUNNING by polling the driver-proxy /metadata endpoint. That poll used a hardcoded 30 × 2s = 60s window.

The problem: after the task reaches RUNNING, the driver proxy still answers /metadata with 503 until the container's HTTP endpoint is actually reachable. With a custom --base-environment, that environment install happens after the task is RUNNING, so warmup routinely outlasts 60s — and ssh connect fails with server is not ok, status code 503 even though the server comes up fine moments later. The 503 originates in the serverless driver-proxy layer, not our server binary (serveMetadata only ever returns 200/500).

This replaces the fixed loop with retries.Poll bounded by the same accelerator-aware budget already used to wait for the task to start (TaskStartupTimeout: 10 min CPU / 45 min GPU). The SDK poller backs off between attempts (linear, capped ~10s, jittered) instead of hammering the proxy every 2s during a long install, and it still succeeds immediately on the first attempt when the server is already warm.

formatServerStartError is extracted so the two failure modes stay distinct: a halted poll (bootstrap job terminated) already carries the job's failure trace and is returned as-is, while a timeout appends describeRunFailure diagnostics. This preserves the prior behavior and avoids printing the failure trace twice.

Why

Serverless SSH connections with a custom base environment (introduced in #5706) frequently fail on first connect with a driver-proxy 503, because installing the base environment pushes health-readiness past the fixed 60s window. Sizing the health-check wait to match the provisioning wait the user already opted into fixes this without a new flag.

Tests

New unit tests TestFormatServerStartErrorTimeoutAppendsRunFailure and TestFormatServerStartErrorHaltReturnedAsIs cover both error-routing branches (timeout appends run diagnostics; halt returns as-is with the trace exactly once).
go test ./experimental/ssh/... passes.
TestAccept/ssh acceptance suite passes: ok github.com/databricks/cli/acceptance 481.255s.

Local run timing

Measured locally (Apple Silicon). Note the acceptance figure includes compiling the full CLI binary on a cold build cache:

Stage	Time
`go mod download` (deps, warm module cache)	~6.6s
`go test ./experimental/ssh/...` (unit, compile + run)	~7.9s
`TestAccept/ssh` (acceptance, incl. full CLI compile)	481.3s (~8m)

The bulk of the acceptance wall-clock is the one-time CLI/dependency compilation on a cold go-build cache; the ssh test execution itself is a small fraction of that.

This pull request and its description were written by Isaac.

The post-RUNNING health check against the driver-proxy /metadata endpoint used a fixed 30 x 2s = 60s window. After the bootstrap task reaches RUNNING the driver proxy still answers 503 until the container's HTTP endpoint is reachable; with a custom --base-environment that install happens post-RUNNING, so warmup routinely outlasts 60s and `ssh connect` fails with a driver-proxy 503 even though the server comes up fine. Poll with retries.Poll for the same accelerator-aware budget already used for the task to start (10m CPU / 45m GPU), backing off between attempts instead of hammering the proxy every 2s during a long wait. Extract formatServerStartError so a terminated bootstrap job returns its failure trace once, while only a timeout appends run diagnostics. Co-authored-by: Isaac

github-actions · 2026-07-02T14:37:28Z

Waiting for approval

Based on git history, these people are best suited to review:

@anton-107 -- recent work in experimental/ssh/internal/client/, ./

Eligible reviewers: @andrewnester, @denik, @pietern, @renaudhartert-db, @shreyas-goenka, @simonfaltum

_{Suggestions based on git history. See OWNERS for ownership rules.}

eng-dev-ecosystem-bot · 2026-07-02T15:19:26Z

Integration test report

Commit: b99d10c

Run: 28598550145

	Env	💚RECOVERED	🙈SKIP	✅pass	🙈skip	Time
💚	aws linux	10	13	230	1041	4:09
💚	aws windows	10	13	232	1039	4:11
💚	aws-ucws linux	10	13	314	959	9:16
💚	aws-ucws windows	10	13	316	957	4:13
💚	azure linux	4	15	230	1040	4:14
💚	azure windows	4	15	232	1038	3:51
💚	azure-ucws linux	4	15	316	956	5:03
💚	azure-ucws windows	4	15	318	954	4:03
💚	gcp linux	4	15	229	1042	9:46
💚	gcp windows	4	15	231	1040	5:10

23 interesting tests: 13 SKIP, 10 RECOVERED

	Test Name	aws linux	aws windows	aws-ucws linux	aws-ucws windows	azure linux	azure windows	azure-ucws linux	azure-ucws windows	gcp linux	gcp windows
💚	TestAccept	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R
🙈	TestAccept/bundle/invariant/no_drift	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/permissions	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions	💚R	💚R	💚R	💚R	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions	💚R	💚R	💚R	💚R	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct	💚R	💚R	💚R	💚R
💚	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform	💚R	💚R	💚R	💚R
🙈	TestAccept/bundle/resources/postgres_branches/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/recreate	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/replace_existing	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/update_protected	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_branches/without_branch_id	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_endpoints/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/postgres_projects/update_display_name	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/synced_database_tables/basic	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
🙈	TestAccept/ssh/connection	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S	🙈S
💚	TestFetchRepositoryInfoAPI_FromRepo	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R
💚	TestFetchRepositoryInfoAPI_FromRepo/root	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R
💚	TestFetchRepositoryInfoAPI_FromRepo/subdir	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R	💚R

Top 8 slowest tests (at least 2 minutes):

duration	env	testname
7:49	gcp linux	TestSecretsPutSecretStringValue
7:34	aws-ucws linux	TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform
4:24	gcp windows	TestSecretsPutSecretStringValue
3:07	aws windows	TestAccept
3:02	aws-ucws windows	TestAccept
2:59	azure-ucws windows	TestAccept
2:59	azure windows	TestAccept
2:35	gcp windows	TestAccept

TanishqDatabricks force-pushed the ssh-health-check-timeout branch from cc2f502 to b99d10c Compare July 2, 2026 14:37

TanishqDatabricks temporarily deployed to test-trigger-is July 2, 2026 14:37 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

experimental/ssh: wait full startup timeout for server health check#5807

experimental/ssh: wait full startup timeout for server health check#5807
TanishqDatabricks wants to merge 1 commit into
mainfrom
ssh-health-check-timeout

TanishqDatabricks commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

eng-dev-ecosystem-bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TanishqDatabricks commented Jul 2, 2026

Changes

Why

Tests

Local run timing

Uh oh!

github-actions Bot commented Jul 2, 2026

Waiting for approval

Uh oh!

eng-dev-ecosystem-bot commented Jul 2, 2026

Integration test report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants