feat(webapp,run-engine): queue metrics and health dashboard by ericallam · Pull Request #4131 · triggerdotdev/trigger.dev

ericallam · 2026-07-03T08:23:47Z

Summary

Adds per-queue observability to the Queues page: depth (backlog), throughput (enqueued, started, completed), concurrency, whether a queue is throttled, and the scheduling delay (how long runs wait between becoming eligible and actually starting). Each queue shows health at a glance in the list, plus a per-queue detail page with charts, so you can answer "does this queue have enough concurrency to keep up?".

Both the data collection and the dashboard are off by default and gated independently: metric emission is a global switch, and the dashboard is turned on per organization. With both off, the Queues page is unchanged.

Design

Queue operations emit two kinds of signal. Gauges (depth, running, limit, throttled) are read inside the same Redis script that performs the enqueue or dequeue, so the reading is atomic, and returned on the script's reply for the app to forward. Counters (enqueued, started, completed) are cumulative odometers, so a dropped reading self-heals: the next one restates the running total. Both land on one Redis stream on a dedicated metrics instance (falling back to the run queue's Redis when self-hosting), drain through a consumer into ClickHouse (raw, a 10-second-bucket materialized view, and a 30-day aggregate), and the dashboards read the aggregate. The run queue's own Redis carries no metrics stream.

The one change that is live the moment this deploys, independent of both flags, is the enqueue/dequeue script reply shape: those scripts now return a 2-tuple so the gauge reading can ride back to the app. That path is exercised on every queue op, so it is the part of run-engine worth the closest review.

changeset-bot · 2026-07-03T08:23:52Z

⚠️ No Changeset found

Latest commit: 5511df5

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai · 2026-07-03T08:25:40Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR adds queue-metrics ingestion, storage, query, and UI support. It introduces a Redis/ClickHouse metrics pipeline package, ClickHouse queue-metrics tables and query helpers, run-queue emission hooks, gap-filling support in TSQL, and new webapp admin, dashboard, list, and detail routes. It also adds environment and feature-flag gating, seed tooling, and tests across the pipeline and query layers.

Related PRs: None found.

Suggested labels: enhancement, area: webapp, area: run-engine, area: internal-packages

Suggested reviewers: ericallam, matt-aitken

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is missing required template sections, including Closes `#issue`, checklist, Testing, Changelog, and Screenshots.	Add the template sections and fill them out, starting with Closes #, the checklist, testing steps, changelog, and screenshots.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title is concise and accurately summarizes the main change: queue metrics and health dashboard for webapp/run-engine.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/queue-metrics-and-health

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

pkg-pr-new · 2026-07-04T08:31:49Z

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@cbe5444

trigger.dev

npm i https://pkg.pr.new/trigger.dev@cbe5444

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@cbe5444

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@cbe5444

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@cbe5444

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@cbe5444

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@cbe5444

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@cbe5444

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@cbe5444

commit: cbe5444

…peline

…signals Gauges are read inside the enqueue/dequeue Lua and returned on the script reply as a 2-tuple; counters are cumulative odometers. The run-queue Redis carries no metrics stream of its own.

…witch

…counters entryOrderKey returns a string built with BigInt math so ordering stays correct at real epoch magnitudes. Odometer keys are namespaced by definition name. The consumer reports null lag for a missing consumer group instead of 0, and empty gauge values parse as NaN rather than 0.

…ng order keys The wait-time quantile materialized view now excludes wait_ms = 0 rows so it matches the count aggregation. order_key accepts a string or a number. Migration comments no longer contain semicolons that split the migration into invalid statements.

…rride The queues list tolerates a metrics query failure by rendering without metrics and logging a warning. UsageSparkline renders its total override even when every bucket is zero. The queue detail page returns 404 and its loader skips the metrics query when the feature flag is off. The seed script validates bucket size and only writes ClickHouse against a local host.

A bucket-led ORDER BY DESC combined with fillGaps emitted an ascending WITH FILL (positive step, ascending bounds), which produces invalid or empty fills. Skip the gap-fill rewrite for descending orders and let the plain descending query stand. Adds a DESC fillGaps test.

Packs the stream sequence with a 1e6 factor (was 1e5) so up to 1M entries per millisecond per shard fit before a seq could spill into the next millisecond's range, far above what a single Redis stream can produce. ms*1e6 stays within UInt64. Also fixes the webapp mapping test that still expected a numeric order_key after the switch to a BigInt-derived string.

The queues list and queue detail pages now use the shared TimeFilter (any preset period or a custom date range) and everything on the page follows it: header tiles, per queue metric columns, charts, and stats. The custom period buttons, hand rolled chart cards, and duplicated metric fetch loops are replaced by the ChartCard and Chart primitives, UsageSparkline, and a shared useMetricResourceQuery hook. The ClickHouse list queries take an explicit end bound so fixed ranges query only their window.

Queries using deltaSumTimestampMerge failed with an unknown function error, which broke the queue detail stats and the started counts on the built in Queues dashboard.

The queues list header tiles now render the same line chart, grid, and tooltip as the rest of the metrics charts instead of a row sparkline, with the headline value in the tile header. The env saturation tile draws the environment concurrency limit and burst limit as labeled reference lines. Chart tooltips keep a gap between the series label and the value, and the shared line chart gains showDots and referenceLines options.

Adds an Allocation tab to the Queues page (behind the queue metrics UI flag): overview cards, a burst-aware capacity bar showing each queue allocation and its live usage in a distinct color, an inline-editable limits table with per-queue locks, load-weighted auto-balance, and a review dialog that bulk-applies limits as overrides through the existing concurrency system. The queue list now defaults to Busiest ordering (with Backlog and Name options). ClickHouse ranks queues by activity over the last 15 minutes and returns just the requested page of names, so the cost per page is one small aggregate regardless of environment size; idle queues follow in name order and any failure falls back to name ordering. The classic page keeps plain name order.

The fallback WHERE injection only targeted the top-level SELECT, so a query shaped as an outer aggregation over a FROM subquery failed to compile: the time column only exists inside the subquery. Descend into the subquery so the fallback lands next to the table reference.

Adds two rollups fed from the raw landing table: a per-queue 5-minute tier and an environment-level 1-minute tier (gauges plus TDigest wait quantiles). Ranking now reads the 5m tier and returns the page and the ranked total in one windowed query instead of two scans. The 5m materialized view reads raw rather than cascading off the 10s table: deltaSumTimestamp states hold a single first/last segment, so merging states in an MV's hash-ordered GROUP BY double-counts bridging spans. For the same reason the env tier carries no counter columns, and env-wide counter totals must group by queue before summing.

The built-in queues dashboard's enqueued vs started chart merged counter states across queues, which mixes unrelated cumulative counters and returns wrong totals; it now merges per queue and sums outside. Env header tiles and saturation charts read the environment rollup, so their cost no longer scales with queue count, and coarse-bucket ranges are served from the 5m rollup automatically. Queue list ranking runs as one query, time bounds are aligned to the bucket grid, and repeated auto-refresh reads share ClickHouse query-cache entries.

… rollup The env rollup's win comes from dropping the queue dimension, not from coarser buckets: row count is queue-independent (~8640/day/env), so full 10-second granularity stays cheap at any range. Env header tiles and saturation charts now resolve short-range detail exactly like the per-queue charts, and the current-value tiles read the latest 10-second bucket instead of a minute-wide one.

The simulator's --reset only cleared the raw and 10s tables, leaving stale rows in the 5m and env rollups. It also force-merges the rollups after seeding so current-value widgets read cleanly.

Counter events now emit per queue and op odometer readings with a seeded zero baseline, matching the production emitter, so throughput and started counts reconstruct from simulated data instead of reading zero. Scenario switches prune the previous scenario's queues, a --project flag seeds each scenario into its own project for side-by-side design review, and a new many-queues scenario covers pagination and relevance ranking with one runaway queue, a busy head, a bursty middle, and a sparse tail. Adds --help.

A --usage flag stages plausible running counts in the local run-queue Redis for the seeded queues, so the list's Running column and the Allocation tab's usage bars have data without the run engine. Staged state is reconciled on every run: present with --usage, cleared without. Local Redis hosts only.

The tail query's exclusion list overwrote the search's name filter via object spread, so searching while sorted by activity showed unrelated queues past the ranked head. Combine the conditions with AND instead.

…ot ready Without a readiness guard, every fire-and-forget emit during a metrics Redis outage queued a command in ioredis's in-memory offline queue until rejection. Metrics are loss-tolerant by design, so drop instead; waitUntilReady() lets embedders await the initial connect.

The allocation view keeps manual limit edits, the review dialog, and bulk apply. The one-shot auto-balance button is removed (and the row locks whose only purpose was protecting queues from it); a policy-driven approach can replace it if rebalancing returns.

deltaSumTimestamp states are kept per queue, and merging them across queues silently returns wrong totals, on the dashboard and the public query API alike. Columns can now declare a mergeGroupKey, and the compiler rejects queries that merge such a column without grouping by that key or pinning it to a single value. The error names the column, explains the failure, and includes a corrected example query.

…e calls Short parameter lists on quantilesMerge and quantilesTDigestMerge do execute (the state layout is parameter independent, verified on both ClickHouse versions we run), but they rely on undocumented leniency and make the result-array indexes mean different quantiles per call site. Every merge now uses the stored four-quantile list with indexes re-pointed accordingly; returned values are unchanged.

devin-ai-integration

Devin Review found 1 new potential issue.

devin-ai-integration · 2026-07-05T12:52:06Z

+      },
+      "wait-pct": {
+        title: "Scheduling delay p50/p95/p99 (ms)",
+        query: `SELECT timeBucket() AS t,\n  round(quantilesTDigestMerge(0.5, 0.9, 0.95, 0.99)(wait_quantiles)[1]) AS p50,\n  round(quantilesTDigestMerge(0.5, 0.9, 0.95, 0.99)(wait_quantiles)[3]) AS p95,\n  round(quantilesTDigestMerge(0.5, 0.9, 0.95, 0.99)(wait_quantiles)[4]) AS p99\nFROM env_metrics\nGROUP BY t\nORDER BY t`,


🚩 Built-in dashboard wait-pct widget query uses quantilesTDigestMerge — verify it targets env_metrics, not queue_metrics

The wait-pct widget in the queues built-in dashboard (apps/webapp/app/presenters/v3/BuiltInDashboards.server.ts:715) uses quantilesTDigestMerge in its query. The queue_metrics_v1 ClickHouse table stores wait_quantiles as AggregateFunction(quantiles(...), UInt32) (regular quantiles), while env_metrics_v1 stores it as AggregateFunction(quantilesTDigest(...), UInt32). Using quantilesTDigestMerge on a quantiles state would fail at ClickHouse execution time. The query text is truncated in the diff so the FROM clause is not visible — if it targets queue_metrics, this is a runtime error; if it targets env_metrics, it's correct. The queue detail page at apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues_.$queueParam/route.tsx:151 correctly uses quantilesMerge for queue_metrics.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration

Devin Review found 1 new potential issue.

devin-ai-integration · 2026-07-05T13:03:03Z

+const queuesDashboard: BuiltInDashboard = {
+  key: "queues",
+  title: "Queues",
+  filters: ["queues"],
+  layout: {


🚩 env_metrics widgets may receive queue filters they cannot apply

The queues dashboard (BuiltInDashboards.server.ts:556) declares filters: ["queues"], enabling a queue filter. The dashboard route passes queues={queues.length > 0 ? queues : undefined} to every MetricWidget. The executeQuery function adds queue: { op: "in", values: queues } to the enforced WHERE clause when queues are non-empty. However, env_metrics has no queue column — it's an environment-level rollup. Whether the TSQL printer silently skips enforced conditions for non-existent columns or errors depends on the printer's column resolution logic. In practice, the env_metrics widgets (env-used, env-limit, sat-time, used-limit) would either silently ignore the filter (correct — env metrics are queue-independent) or show an error in the widget. Worth verifying the printer's behavior with a quick test if queue filtering on the queues dashboard is expected to be used.

Was this helpful? React with 👍 or 👎 to provide feedback.

This comment was marked as resolved.

Sign in to view

ericallam marked this pull request as ready for review July 3, 2026 10:26

This comment was marked as resolved.

Sign in to view

ericallam force-pushed the feat/queue-metrics-and-health branch from a892684 to 9412bf5 Compare July 4, 2026 08:30

This comment was marked as resolved.

Sign in to view

ericallam force-pushed the feat/queue-metrics-and-health branch from 6432d9f to 3c67a0c Compare July 4, 2026 22:16

This comment was marked as resolved.

Sign in to view

ericallam closed this Jul 5, 2026

ericallam reopened this Jul 5, 2026

ericallam added 13 commits July 5, 2026 13:41

feat(metrics-pipeline): generic Redis-stream to ClickHouse metrics pi…

73efe53

…peline

feat(clickhouse): queue metrics tables and read queries

7c7e9f0

feat(run-engine): emit queue depth, throughput, and scheduling-delay …

dbd725e

…signals Gauges are read inside the enqueue/dequeue Lua and returned on the script reply as a 2-tuple; counters are cumulative odometers. The run-queue Redis carries no metrics stream of its own.

feat(webapp): queue metrics ingestion, admin controls, and emission s…

761ebbe

…witch

feat(tsql): opt-in gap-fill for time-bucketed series

a3d88c1

feat(webapp): Queues dashboard and per-org metrics UI flag

e44e3ce

chore(webapp): add server-changes note for queue metrics

9036161

chore: apply oxfmt formatting

a9003c0

chore: use import type for type-only imports

118a18d

fix(tsql): avoid polynomial backtracking in ORDER BY direction strip

79660ad

fix(tsql): strip ORDER BY direction without a backtracking regex

83cb71d

fix(clickhouse): remove semicolons from queue metrics migration comments

f2f1921

test(clickhouse): rewrite queue metrics test for cumulative counters

f088d4a

ericallam added 24 commits July 5, 2026 13:42

test(run-engine): import describe from vitest in run-queue metrics test

7a0f14e

fix(tsql): register the deltaSumTimestampMerge aggregate

a2ab1dc

Queries using deltaSumTimestampMerge failed with an unknown function error, which broke the queue detail stats and the started counts on the built in Queues dashboard.

chore(webapp): use shared primitives on the admin queue metrics page

3131fc5

feat(clickhouse): queue activity ranking queries

42139c2

fix(webapp): include rollup tables in the queue metrics simulator reset

3991214

The simulator's --reset only cleared the raw and 10s tables, leaving stale rows in the 5m and env rollups. It also force-merges the rollups after seeding so current-value widgets read cleanly.

fix(webapp): keep the search filter on the ranked queue list's tail

5851eb7

The tail query's exclusion list overwrote the search's name filter via object spread, so searching while sorted by activity showed unrelated queues past the ranked head. Combine the conditions with AND instead.

ericallam force-pushed the feat/queue-metrics-and-health branch from 1416ee6 to cbe5444 Compare July 5, 2026 12:44

devin-ai-integration Bot reviewed Jul 5, 2026

View reviewed changes

fix(tsql): satisfy the tenant column type in the merge guard test schema

5511df5

devin-ai-integration Bot reviewed Jul 5, 2026

View reviewed changes

Uh oh!

Uh oh!

Conversation

ericallam commented Jul 3, 2026

Summary

Design

Uh oh!

changeset-bot Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

coderabbitai Bot commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

pkg-pr-new Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot Jul 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

changeset-bot Bot commented Jul 3, 2026 •

edited

Loading

coderabbitai Bot commented Jul 3, 2026 •

edited

Loading

pkg-pr-new Bot commented Jul 4, 2026 •

edited

Loading