feat(webapp,run-engine): queue metrics and health dashboard#4131
feat(webapp,run-engine): queue metrics and health dashboard#4131ericallam wants to merge 38 commits into
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds queue-metrics ingestion, storage, query, and UI support. It introduces a Redis/ClickHouse metrics pipeline package, ClickHouse queue-metrics tables and query helpers, run-queue emission hooks, gap-filling support in TSQL, and new webapp admin, dashboard, list, and detail routes. It also adds environment and feature-flag gating, seed tooling, and tests across the pipeline and query layers. Related PRs: None found. Suggested labels: enhancement, area: webapp, area: run-engine, area: internal-packages Suggested reviewers: ericallam, matt-aitken 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a892684 to
9412bf5
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
6432d9f to
3c67a0c
Compare
…signals Gauges are read inside the enqueue/dequeue Lua and returned on the script reply as a 2-tuple; counters are cumulative odometers. The run-queue Redis carries no metrics stream of its own.
…counters entryOrderKey returns a string built with BigInt math so ordering stays correct at real epoch magnitudes. Odometer keys are namespaced by definition name. The consumer reports null lag for a missing consumer group instead of 0, and empty gauge values parse as NaN rather than 0.
…ng order keys The wait-time quantile materialized view now excludes wait_ms = 0 rows so it matches the count aggregation. order_key accepts a string or a number. Migration comments no longer contain semicolons that split the migration into invalid statements.
…rride The queues list tolerates a metrics query failure by rendering without metrics and logging a warning. UsageSparkline renders its total override even when every bucket is zero. The queue detail page returns 404 and its loader skips the metrics query when the feature flag is off. The seed script validates bucket size and only writes ClickHouse against a local host.
A bucket-led ORDER BY DESC combined with fillGaps emitted an ascending WITH FILL (positive step, ascending bounds), which produces invalid or empty fills. Skip the gap-fill rewrite for descending orders and let the plain descending query stand. Adds a DESC fillGaps test.
Packs the stream sequence with a 1e6 factor (was 1e5) so up to 1M entries per millisecond per shard fit before a seq could spill into the next millisecond's range, far above what a single Redis stream can produce. ms*1e6 stays within UInt64. Also fixes the webapp mapping test that still expected a numeric order_key after the switch to a BigInt-derived string.
The queues list and queue detail pages now use the shared TimeFilter (any preset period or a custom date range) and everything on the page follows it: header tiles, per queue metric columns, charts, and stats. The custom period buttons, hand rolled chart cards, and duplicated metric fetch loops are replaced by the ChartCard and Chart primitives, UsageSparkline, and a shared useMetricResourceQuery hook. The ClickHouse list queries take an explicit end bound so fixed ranges query only their window.
Queries using deltaSumTimestampMerge failed with an unknown function error, which broke the queue detail stats and the started counts on the built in Queues dashboard.
The queues list header tiles now render the same line chart, grid, and tooltip as the rest of the metrics charts instead of a row sparkline, with the headline value in the tile header. The env saturation tile draws the environment concurrency limit and burst limit as labeled reference lines. Chart tooltips keep a gap between the series label and the value, and the shared line chart gains showDots and referenceLines options.
Adds an Allocation tab to the Queues page (behind the queue metrics UI flag): overview cards, a burst-aware capacity bar showing each queue allocation and its live usage in a distinct color, an inline-editable limits table with per-queue locks, load-weighted auto-balance, and a review dialog that bulk-applies limits as overrides through the existing concurrency system. The queue list now defaults to Busiest ordering (with Backlog and Name options). ClickHouse ranks queues by activity over the last 15 minutes and returns just the requested page of names, so the cost per page is one small aggregate regardless of environment size; idle queues follow in name order and any failure falls back to name ordering. The classic page keeps plain name order.
The fallback WHERE injection only targeted the top-level SELECT, so a query shaped as an outer aggregation over a FROM subquery failed to compile: the time column only exists inside the subquery. Descend into the subquery so the fallback lands next to the table reference.
Adds two rollups fed from the raw landing table: a per-queue 5-minute tier and an environment-level 1-minute tier (gauges plus TDigest wait quantiles). Ranking now reads the 5m tier and returns the page and the ranked total in one windowed query instead of two scans. The 5m materialized view reads raw rather than cascading off the 10s table: deltaSumTimestamp states hold a single first/last segment, so merging states in an MV's hash-ordered GROUP BY double-counts bridging spans. For the same reason the env tier carries no counter columns, and env-wide counter totals must group by queue before summing.
The built-in queues dashboard's enqueued vs started chart merged counter states across queues, which mixes unrelated cumulative counters and returns wrong totals; it now merges per queue and sums outside. Env header tiles and saturation charts read the environment rollup, so their cost no longer scales with queue count, and coarse-bucket ranges are served from the 5m rollup automatically. Queue list ranking runs as one query, time bounds are aligned to the bucket grid, and repeated auto-refresh reads share ClickHouse query-cache entries.
… rollup The env rollup's win comes from dropping the queue dimension, not from coarser buckets: row count is queue-independent (~8640/day/env), so full 10-second granularity stays cheap at any range. Env header tiles and saturation charts now resolve short-range detail exactly like the per-queue charts, and the current-value tiles read the latest 10-second bucket instead of a minute-wide one.
The simulator's --reset only cleared the raw and 10s tables, leaving stale rows in the 5m and env rollups. It also force-merges the rollups after seeding so current-value widgets read cleanly.
Counter events now emit per queue and op odometer readings with a seeded zero baseline, matching the production emitter, so throughput and started counts reconstruct from simulated data instead of reading zero. Scenario switches prune the previous scenario's queues, a --project flag seeds each scenario into its own project for side-by-side design review, and a new many-queues scenario covers pagination and relevance ranking with one runaway queue, a busy head, a bursty middle, and a sparse tail. Adds --help.
A --usage flag stages plausible running counts in the local run-queue Redis for the seeded queues, so the list's Running column and the Allocation tab's usage bars have data without the run engine. Staged state is reconciled on every run: present with --usage, cleared without. Local Redis hosts only.
The tail query's exclusion list overwrote the search's name filter via object spread, so searching while sorted by activity showed unrelated queues past the ranked head. Combine the conditions with AND instead.
…ot ready Without a readiness guard, every fire-and-forget emit during a metrics Redis outage queued a command in ioredis's in-memory offline queue until rejection. Metrics are loss-tolerant by design, so drop instead; waitUntilReady() lets embedders await the initial connect.
The allocation view keeps manual limit edits, the review dialog, and bulk apply. The one-shot auto-balance button is removed (and the row locks whose only purpose was protecting queues from it); a policy-driven approach can replace it if rebalancing returns.
deltaSumTimestamp states are kept per queue, and merging them across queues silently returns wrong totals, on the dashboard and the public query API alike. Columns can now declare a mergeGroupKey, and the compiler rejects queries that merge such a column without grouping by that key or pinning it to a single value. The error names the column, explains the failure, and includes a corrected example query.
…e calls Short parameter lists on quantilesMerge and quantilesTDigestMerge do execute (the state layout is parameter independent, verified on both ClickHouse versions we run), but they rely on undocumented leniency and make the result-array indexes mean different quantiles per call site. Every merge now uses the stored four-quantile list with indexes re-pointed accordingly; returned values are unchanged.
1416ee6 to
cbe5444
Compare
| }, | ||
| "wait-pct": { | ||
| title: "Scheduling delay p50/p95/p99 (ms)", | ||
| query: `SELECT timeBucket() AS t,\n round(quantilesTDigestMerge(0.5, 0.9, 0.95, 0.99)(wait_quantiles)[1]) AS p50,\n round(quantilesTDigestMerge(0.5, 0.9, 0.95, 0.99)(wait_quantiles)[3]) AS p95,\n round(quantilesTDigestMerge(0.5, 0.9, 0.95, 0.99)(wait_quantiles)[4]) AS p99\nFROM env_metrics\nGROUP BY t\nORDER BY t`, |
There was a problem hiding this comment.
🚩 Built-in dashboard wait-pct widget query uses quantilesTDigestMerge — verify it targets env_metrics, not queue_metrics
The wait-pct widget in the queues built-in dashboard (apps/webapp/app/presenters/v3/BuiltInDashboards.server.ts:715) uses quantilesTDigestMerge in its query. The queue_metrics_v1 ClickHouse table stores wait_quantiles as AggregateFunction(quantiles(...), UInt32) (regular quantiles), while env_metrics_v1 stores it as AggregateFunction(quantilesTDigest(...), UInt32). Using quantilesTDigestMerge on a quantiles state would fail at ClickHouse execution time. The query text is truncated in the diff so the FROM clause is not visible — if it targets queue_metrics, this is a runtime error; if it targets env_metrics, it's correct. The queue detail page at apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues_.$queueParam/route.tsx:151 correctly uses quantilesMerge for queue_metrics.
Was this helpful? React with 👍 or 👎 to provide feedback.
| const queuesDashboard: BuiltInDashboard = { | ||
| key: "queues", | ||
| title: "Queues", | ||
| filters: ["queues"], | ||
| layout: { |
There was a problem hiding this comment.
🚩 env_metrics widgets may receive queue filters they cannot apply
The queues dashboard (BuiltInDashboards.server.ts:556) declares filters: ["queues"], enabling a queue filter. The dashboard route passes queues={queues.length > 0 ? queues : undefined} to every MetricWidget. The executeQuery function adds queue: { op: "in", values: queues } to the enforced WHERE clause when queues are non-empty. However, env_metrics has no queue column — it's an environment-level rollup. Whether the TSQL printer silently skips enforced conditions for non-existent columns or errors depends on the printer's column resolution logic. In practice, the env_metrics widgets (env-used, env-limit, sat-time, used-limit) would either silently ignore the filter (correct — env metrics are queue-independent) or show an error in the widget. Worth verifying the printer's behavior with a quick test if queue filtering on the queues dashboard is expected to be used.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Adds per-queue observability to the Queues page: depth (backlog), throughput (enqueued, started, completed), concurrency, whether a queue is throttled, and the scheduling delay (how long runs wait between becoming eligible and actually starting). Each queue shows health at a glance in the list, plus a per-queue detail page with charts, so you can answer "does this queue have enough concurrency to keep up?".
Both the data collection and the dashboard are off by default and gated independently: metric emission is a global switch, and the dashboard is turned on per organization. With both off, the Queues page is unchanged.
Design
Queue operations emit two kinds of signal. Gauges (depth, running, limit, throttled) are read inside the same Redis script that performs the enqueue or dequeue, so the reading is atomic, and returned on the script's reply for the app to forward. Counters (enqueued, started, completed) are cumulative odometers, so a dropped reading self-heals: the next one restates the running total. Both land on one Redis stream on a dedicated metrics instance (falling back to the run queue's Redis when self-hosting), drain through a consumer into ClickHouse (raw, a 10-second-bucket materialized view, and a 30-day aggregate), and the dashboards read the aggregate. The run queue's own Redis carries no metrics stream.
The one change that is live the moment this deploys, independent of both flags, is the enqueue/dequeue script reply shape: those scripts now return a 2-tuple so the gauge reading can ride back to the app. That path is exercised on every queue op, so it is the part of
run-engineworth the closest review.