Skip to content

#313 P1+P3: replace the boot-time facet-index full scan with a tiny trusted manifest#317

Merged
rdhyee merged 6 commits into
isamplesorg:mainfrom
rdhyee:fix/313-facet-index-manifest-p1-p3
Jul 2, 2026
Merged

#313 P1+P3: replace the boot-time facet-index full scan with a tiny trusted manifest#317
rdhyee merged 6 commits into
isamplesorg:mainfrom
rdhyee:fix/313-facet-index-manifest-p1-p3

Conversation

@rdhyee

@rdhyee rdhyee commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What this fixes

Part of #313 (Explorer slowness on slow connections). Live repro today: a URL with a preset facet filter at continental zoom took ~45-50 seconds to fully resolve on current production — reproduced independent of any recent changes, so this is a pre-existing latency issue, not a regression.

facetIndexReady in explorer.qmd currently does two expensive things against the live sample_facet_index.parquet (9.68 MB, ~6M rows) on every page load, blocking multi-filter count readiness:

  1. SELECT DISTINCT build_id, schema_version FROM read_parquet(index_url) — touches build_id/schema_version columns across every row group of the 9.68 MB file.
  2. A coverage check: SELECT source, COUNT(*) FROM read_parquet(index_url) GROUP BY source vs facet_summaries — a full 6M-row scan.

This PR eliminates both, per the joint Claude+Codex mitigation plan from the original 2026-06-26 investigation (P0, the "Loading…" honesty-state fix, already shipped as #316).

P1 — trusted build-time manifest

  • New sample_facet_index_meta.parquet artifact (scripts/build_frontend_derived.py): a tiny (~1 KB) per-source histogram + build_id/schema_version/total_rows, computed directly from samp_geo — the same authoritative table sample_facet_index itself derives from, not read back from the index (independence is the point: a buggy index build could carry self-consistent-but-wrong metadata).
  • Independent validation gate (scripts/validate_frontend_derived.py): reads the actual on-disk sample_facet_index.parquet (full scan — fine at build/CI time, never the browser critical path) and asserts it matches the manifest.
  • explorer.qmd's facetIndexReady now reads the tiny manifest instead of scanning the big index. Same checks (schema version, node_bits generation match, coverage vs facet_summaries), same data, just a cheaper source. The big index is now touched only lazily, when a user's actual multi-filter query runs.
  • Escape hatch: --only sample_facet_index_meta builds just the meta file without forcing a full index rebuild — for pairing a new meta file with an already-deployed index built from the same input (see deployment note below).

P3 — decouple the masks scan from the readiness gate

facetIndexReady previously waited on the entire nodeBitsReady cell, including a 9.67 MB masks scan it doesn't actually need (only __nodeBitsBuild, set after a 2 KB fetch). Split into nodeBitsCoreReady (fast) + nodeBitsReady (masks scan, now sequenced to run after facetIndexReady settles so the two don't contend for the single DuckDB-WASM connection).

P6 (targeted) — Firefox regression spec

Narrow firefox-facet-index-meta Playwright project, scoped to one new spec proving the pending→failed→ready UI contract and that a held/blocked manifest fetch never produces a permanent-looking stuck state.

Verification

  • 46/46 JS unit tests, 39/39 Python pipeline tests, explorer-smoke (chromium), and the new Firefox spec (3/3 clean runs) all pass.
  • Confirmed the new manifest pairs with the already-deployed index with zero risk: built it locally from a wide.parquet whose sha256 byte-matches what produced the currently-live sample_facet_index.parquet, and verified via the live public URL that the build_ids are identical.
  • Ran the new independent validator locally against the real deployed index — all new checks pass.
  • Two rounds of Codex review: design-level (conditional LGTM, 3 required corrections, all applied) and a final code-review pass on this diff (clean LGTM, no blocking issues).

⚠️ Deployment note — one manual step required

No R2 write access was available while building this, so the new sample_facet_index_meta.parquet file has not been uploaded. Built locally at /Users/raymondyee/Data/iSample/pqg_refining/staged_202608/p1_meta_local/isamples_202608_sample_facet_index_meta.parquet (also verified reproducible independently, sha256-matched inputs).

This is safe to merge before the upload happens: today, facetIndexReady always ends in 'failed' (no index at all reachable in a useful way for this check). After merging but before uploading the new file, it will still end in 'failed', just via a fast 404 instead of a slow scan — a net improvement to the failure path with zero behavior change to the success path (which simply isn't reachable yet either way). It becomes fully active (fast 'ready' path) the moment isamples_202608_sample_facet_index_meta.parquet is uploaded to R2 (isamples-ry bucket) alongside the existing isamples_202608_sample_facet_index.parquet — same build_id, confirmed paired above.

Relates to #313 (not closing — P1+P3 shipped; P2 DuckDB-WASM upgrade and P4/P5 remain deferred per the original review).

🤖 Generated with Claude Code

rdhyee and others added 6 commits July 1, 2026 15:43
…st, derived from samp_geo)

New build_sample_facet_index_meta() computes the per-source histogram directly
from samp_geo (the same authoritative located-universe table
build_sample_facet_index/build_facet_summaries already derive from), NOT by
reading back sample_facet_index.parquet itself -- independence is the point,
per Codex's 2026-07-01 review: an independent validator can then read the
actual on-disk index and prove meta/index/facet_summaries agree.

Registered in ARTIFACTS/HIER_ARTIFACTS, deliberately excluded from force_deps
so `--only sample_facet_index_meta` alone builds just the meta file -- the
escape hatch for pairing a new meta with an already-deployed index built from
the same wide input.

Part of isamplesorg#313 P1+P3 (facetIndexReady latency fix); validator + explorer.qmd
wiring + P3 decoupling + P6 targeted test to follow in this branch.
…ainst the real index

New --index-meta gate in validate_frontend_derived.py: schema/shape checks,
then (given --index) a FRESH full scan of the actual on-disk sample_facet_index
recomputes the per-source histogram/build_id/schema_version/row_count and
diffs it against the manifest via symmetric EXCEPT (relational content, not
byte identity) -- this is the independence Codex's review required: the
validator does not trust meta's self-reported numbers or read meta back to
derive its own expectation. Also cross-checks meta against facet_summaries'
source facet, mirroring the comparison the explorer runtime performs.

Continues isamplesorg#313 P1+P3 (see prior commit).
…ntract

Adds SERIALIZATIONS.md §4.13 and a DATA_PROVENANCE.md summary line for
the new manifest artifact: independence from sample_facet_index (built
from samp_geo, not read back), the --only escape hatch, and the R2
same-build_id pairing requirement.
…decouple masks scan

P1: facetIndexReady now reads index_meta_url (a few KB, built at compile time
from samp_geo and independently validated against the real index) instead of
scanning the 9.68MB sample_facet_index.parquet directly. Same checks (schema
version, node_bits generation match, per-source coverage vs facet_summaries),
same data, just sourced from the cheap pre-verified manifest. The big index
file is now touched only lazily, when a user's actual multi-filter count
query runs -- never during the readiness check.

P3: split nodeBitsReady into nodeBitsCoreReady (step 1, node_bits fetch,
publishes __nodeBitsMap/__nodeBitsBuild) and a thinner nodeBitsReady (step 2,
the 9.67MB masks scan). facetIndexReady now depends on nodeBitsCoreReady only
-- previously it depended on the whole nodeBitsReady cell, which meant it
couldn't even start until the masks scan finished, even though the values it
needs are published synchronously before that scan begins. nodeBitsReady
itself now awaits facetIndexReady's settlement (ready or failed, either is
fine) before starting the masks scan, so the two don't race for the single
DuckDB-WASM connection -- same discipline as whenConnectionIdle elsewhere in
this file.

Completes the explorer.qmd side of isamplesorg#313 P1+P3 (see prior two commits for the
data-pipeline side: build_frontend_derived.py + validate_frontend_derived.py).
…ding/failed race

Adds a narrow firefox-facet-index-meta Playwright project scoped to ONE new
spec (tests/playwright/facet-index-meta-pending.spec.js), not a broad Firefox
enable. Test 1 uses page.route() to hold/release the sample_facet_index_meta
fetch and proves window.__facetIndexStatus stays 'pending' while held and
settles (ready/failed) once released. Test 2 exercises the exact UI contract
for 2 active Material filters at global view across pending -> failed ->
ready, reusing the real production handleFacetFilterChange/
updateCrossFilteredCounts code path.

Empirical finding baked into the design (documented in the spec's header):
DuckDB-WASM's non-threaded worker serializes queries, so holding the meta
fetch open also starves the Material facet's own independent query -- a
real held request and "Material checkboxes interactive" can't coexist in a
single fresh page load. Test 2 therefore drives window.__facetIndexStatus
directly (the same global the real preflight sets) after a normal boot,
which lets it assert the pending/failed contract deterministically and still
trigger a REAL count query for the 'ready' step (sample_facet_index and
facet_node_bits are already live on R2; only the new meta manifest isn't).
That real query was confirmed to genuinely start against production but did
not resolve within the spec's window in this sandboxed environment (a large,
network-bound full-file read) -- so the 'ready' step is a best-effort/soft
check, not a hard CI assertion, with the reasoning documented inline.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01XEtSoXjsKtnYWQ7yS8mGRo
…ntract spec

Test 2 (pending -> failed -> ready UI contract) failed on repeat local runs:
the DOM was still showing the "(Loading…)" pending state when the test
expected "(—)" failed, well past the original 45s poll window.

Tried and reverted: blocking the real sample_facet_index_meta fetch to
"neutralize" the real boot-time preflight racing the test's manual
window.__facetIndexStatus injections. That reintroduces the exact FIFO
single-worker starvation the spec's own DESIGN NOTE documents -- Material's
facet_tree_summaries query gets stuck behind the held route on the same
DuckDB-WASM worker, so the checkboxes this test needs never render at all.

Root cause is more likely general single-worker query-queue congestion in
this sandbox's network path to data.isamples.org (the same Firefox slowness
already documented for the 'ready' step) occasionally delaying the
pending->failed repaint past 45s, not a status race -- the real preflight
resolves to 'failed' quickly (a 404, not a large download) well before this
test's manual steps run.

Fix: generous-but-bounded timeouts (45s -> 90s) on both the pending and
failed polls, test.setTimeout 180s -> 300s to give them room. Verified 3/3
clean runs locally after the change (previously flaked on run 2 of 2).

Also verified independently: 46/46 unit tests, 39/39 python pipeline tests,
explorer-smoke (chromium) all still pass.
@rdhyee rdhyee merged commit 97a7fb5 into isamplesorg:main Jul 2, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant