update umbra and add umbra s3 by toschmidt · Pull Request #951 · ClickHouse/ClickBench

toschmidt · 2026-06-19T14:06:29Z

Updates Umbra and improves runtimes by loading the data from parquet and using compression.

This also resolves the memory usage problem on smaller instances reported in #543 and the infinite loop during the concurrent measurements in #891.

Add a new Umbra variant that stores the data on S3 instead of the EBS device. Similar to ClickHouse (web), which also uses S3 as a storage backend.

Update the Umbra ClickBench definition: drop the primary key, ingest from the Athena hits.parquet via the umbra.parquetview table function (instead of a TSV COPY), and store the table with zstd compression (create.sql, the Docker variant, reads /data/hits.parquet from the bind mount). Switch the dataset download to hits.parquet (BENCH_DOWNLOAD_SCRIPT) to match, require the loaded row count to equal exactly 99,997,497 (a partial load otherwise sails through with implausibly fast timings on the surviving subset), and run the Docker container --privileged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add a new ClickBench system, "Umbra (S3)", that runs the Umbra benchmark with the hits table stored on Amazon S3 (backend=cloud) instead of local disk. It mirrors ../umbra; the only functional differences are where the table data lives and the bucket-provisioning step it requires. How it works: - create.sql registers an S3 bucket as Umbra remote storage and creates the table with backend=cloud, so table data lives in the bucket. The dataset (hits.parquet) is still ingested from a local copy via umbra.parquetview; only the resulting table is stored in S3. - create-bucket provisions the bucket before the run: it resolves AWS credentials (env vars, then `aws configure`, then an interactive prompt), creates a globally-unique clickbench-umbra-s3-<date>-<uuid> bucket, and writes bucket/region/keys to a gitignored, chmod-600 .s3-env. The same static keys are handed to Umbra's `create remote storage` statement, so they must allow normal S3 data access. - load sources .s3-env automatically and fails fast with a clear message if the UMBRA_S3_* vars are unset. After ingest it asserts the table has exactly 99,997,497 rows, since Umbra has been observed to leave a partial table on memory-constrained hosts and still produce implausibly fast warm timings on the surviving subset. - Umbra addresses the bucket as s3://<bucket>:<region>/<path> — the region is part of the URI, not a separate option. - delete-bucket empties and deletes the bucket recorded in .s3-env and removes the file; idempotent and touches no IAM resources. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

toschmidt and others added 2 commits June 22, 2026 13:16

toschmidt force-pushed the schmidt/umbra26.06 branch from f18517e to 0c1d4b1 Compare June 22, 2026 11:17

toschmidt changed the title ~~update umbra~~ update umbra and add umbra s3 Jun 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update umbra and add umbra s3#951

update umbra and add umbra s3#951
toschmidt wants to merge 2 commits into
ClickHouse:mainfrom
umbra-db:schmidt/umbra26.06

toschmidt commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toschmidt commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

toschmidt commented Jun 19, 2026 •

edited

Loading