Skip to content

Add collection facet to explorer (e.g. OpenContext PKAP) (#243)#244

Draft
rdhyee wants to merge 1 commit into
isamplesorg:mainfrom
rdhyee:feat/collection-facet
Draft

Add collection facet to explorer (e.g. OpenContext PKAP) (#243)#244
rdhyee wants to merge 1 commit into
isamplesorg:mainfrom
rdhyee:feat/collection-facet

Conversation

@rdhyee

@rdhyee rdhyee commented May 29, 2026

Copy link
Copy Markdown
Contributor

Resolves #243.

Adds a first-class collection dimension to the explorer: filter to a named
SamplingSite label (e.g. the OpenContext project "PKAP Survey Area") and
layer the existing material / context / object_type facets on top.

Why this design (additive)

"Collection" identity lives on SamplingSite entities, reached only by the
MaterialSampleRecord → produced_by → SamplingEvent → sampling_site → SamplingSite traversal — never on the sample rows the explorer renders. Doing
that array-join live in DuckDB-WASM is the documented in-browser bottleneck, so
membership is precomputed. The current sample_facets_v2 / facet_summaries / facet_cross_filter build pipeline isn't in any repo, so rather than risk
regenerating those, this feature is strictly additive — two new files that
touch nothing existing:

  • collections.parquet — dimension (collection_id, label, source, n_samples, centroid_lat/lng, bbox). 61,695 rows, ~3 MB. Powers the top-N
    checkboxes, the search box, and the Featured-Collections preset cameras.
  • sample_collections.parquet — membership (pid → collection_id). ~13 MB.
    The filter appends a second pid IN (SELECT … ) subquery in
    facetFilterSQL(), exactly parallel to the existing facet predicate.

A "collection" = a SamplingSite label (≈1,336 site rows share "PKAP Survey
Area"), keyed by a stable hash of (source, label). Verified: PKAP = 15,446
samples.

What's in the PR

  • scripts/build_collections.py — builds both files from /current/wide.parquet.
  • explorer.qmd — dual-UX collection facet (top-N checkboxes + search-the-tail
    for the ~60K long tail), ?collection= URL param wired through the existing
    facet lifecycle (applyQueryToFacetFilters / writeQueryState /
    handleFacetFilterChange) and the facetFilterSQL() chokepoint.
  • collections.qmd — Featured Collections page upgraded to identity-based
    &collection=<id> links + camera fly.
  • EXPLORER_STATE.md, data.qmd — document the new param and files.
  • tests/test_collections.py — Collections page + explorer facet-DOM checks.

⚠️ Merge gate — requires R2 upload first

The facet is inert until the two files are live on data.isamples.org:

  • Run python scripts/build_collections.py --out-dir <dir> --snapshot 202604
  • Upload isamples_202604_collections.parquet + isamples_202604_sample_collections.parquet to R2 (behind the data.isamples.org Worker)
  • Verify live: open explorer.html?collection=dd74c71982da0e21 → PKAP samples; layer a material facet to confirm it narrows
  • Run tests/test_collections.py against the deployed site

Known limitations (v1)

  • Collection facet counts are the collection's static total — not
    cross-filtered against other facets (no cross_filter cache for collections).
    The dots and table do respect the filter. Documented in EXPLORER_STATE.md.
  • Like the other facets, collection filtering applies at neighborhood/point zoom,
    not to zoomed-out H3 clusters (same #facetNote caveat).

🤖 Generated with Claude Code

…samplesorg#243)

Additive 'collection' dimension: filter the explorer to a named SamplingSite
label (e.g. OpenContext 'PKAP Survey Area'). Precomputes site membership via
the wide-parquet Sample->Event->Site traversal into two new R2 files; touches
none of the existing facet files. Rebased onto main so it sits cleanly on top
of the merged isamplesorg#242 heatmap work (disjoint regions, no conflict).

- scripts/build_collections.py: builds collections.parquet + sample_collections
  .parquet. Unnests BOTH relationship arrays (multi-event/multi-site safe),
  counts DISTINCT pids, orders membership by collection_id for row-group
  pruning. PKAP=15,446 verified; both files live on data.isamples.org.
- explorer.qmd: dual-UX collection facet (top-N checkboxes + search-the-tail),
  ?collection= URL param wired through the existing facet lifecycle and the
  facetFilterSQL() chokepoint (2nd subquery against sample_collections.parquet).
- collections.qmd: Featured Collections page uses identity-based &collection=.
- EXPLORER_STATE.md, data.qmd: document the new param and files.
- tests/test_collections.py: page + facet-DOM checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rdhyee rdhyee force-pushed the feat/collection-facet branch from c3a34e5 to 219b400 Compare June 30, 2026 21:39
@rdhyee

rdhyee commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Rebased onto upstream/main (was 41 commits behind) to resolve conflicts with the #305 facet-count stack. All conflicts were in explorer.qmd, in the four spots where this branch's collection-facet wiring sat next to the #305 facet-tree/mask/index machinery — composed both sides rather than picking one:

No EXPLORER_STATE.md/_quarto.yml/data.qmd conflicts — those merged cleanly.

Tests: npm run test:unit — 13/13 pass. tests/test_collections.py needs a live rendered site (not the prod fallback, since collections.qmd/explorer.qmd changes aren't deployed there yet) — ran it against a local quarto preview server (ISAMPLES_BASE_URL=http://localhost:5860) and all 7 tests pass.

Still draft — leaving as-is since the R2 upload of isamples_202604_collections.parquet/isamples_202604_sample_collections.parquet hasn't happened yet per the PR description's merge gate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a 'collection' dimension to the explorer (e.g. OpenContext PKAP) — precompute site membership, then facet

1 participant