From 5ec8b33eff9df1b71b7e2ee1538ddc9cc77fd9a5 Mon Sep 17 00:00:00 2001 From: Ricardo Devis Agullo Date: Mon, 22 Jun 2026 20:15:03 +0200 Subject: [PATCH] Document registry metadata stores feature and migration --- .../docs/registry/registry-configuration.md | 1 + .../docs/registry/registry-metadata-stores.md | 368 ++++++++++++++++++ website/sidebars.js | 1 + website/static/llms.txt | 1 + 4 files changed, 371 insertions(+) create mode 100644 website/docs/registry/registry-metadata-stores.md diff --git a/website/docs/registry/registry-configuration.md b/website/docs/registry/registry-configuration.md index a434ccc..4f6641f 100644 --- a/website/docs/registry/registry-configuration.md +++ b/website/docs/registry/registry-configuration.md @@ -121,6 +121,7 @@ For unsuscribing to all [events](#registry-events). | storage | object | no | - | Low-level storage adapter configuration. Use this for non-S3 storage adapters | | storage.adapter | function | yes\* | - | Storage adapter function that returns a StorageAdapter instance (\*required if using storage) | | storage.options | object | yes\* | - | Options passed to the storage adapter, must include `componentsDir` (\*required if using storage) | +| metadata | object | no | - | Opt-in [metadata store](/docs/registry/registry-metadata-stores) configuration. When present, the registry uses a pluggable database as the source of truth for the component index instead of scanning storage. Absent → storage-only mode (default) | | refreshInterval | number (seconds) | no | - | Seconds between each refresh of the internal component list cache. Given the data is immutable, this should be high and just for robustness | | pollingInterval | number (seconds) | no | `5` | Seconds between each poll of the storage adapter for changes. Should be low (5-10 seconds) for efficient distribution | | tempDir | string | no | `"./temp/"` | Directory where components' packages are temporarily stored during publishing | diff --git a/website/docs/registry/registry-metadata-stores.md b/website/docs/registry/registry-metadata-stores.md new file mode 100644 index 0000000..c481962 --- /dev/null +++ b/website/docs/registry/registry-metadata-stores.md @@ -0,0 +1,368 @@ +--- +sidebar_position: 3 +--- + +# Registry Metadata Stores + +## Overview + +By default, an OpenComponents registry keeps its **metadata index** — the list of +which components and versions exist — in two storage files: +`components.json` and `components-details.json`. This index is *derived*: the +registry rebuilds it by scanning the whole storage directory tree +(`componentsDir///package.json`) on startup and again after +every publish. + +That scan is `O(registry size)`. It also isn't transactional: concurrent +publishes on different nodes each scan and then overwrite `components.json` +(last-writer-wins), which only "works" because the next scan re-derives the truth +from the immutable directory tree. Under heavy, multi-node publishing this drives +CPU and GC pressure and makes publish cost grow with the size of the registry. + +A **metadata store** is an opt-in feature that moves this index into a pluggable +database that becomes the source of truth for *what exists*. With it: + +- **Publish** becomes an `O(1)` atomic row append instead of a full scan + blob + rewrite. +- **Startup** becomes a single `O(registry size)` query instead of a directory + walk (and steady-state hydration is just one query). +- **Cross-node correctness** is enforced by a `PRIMARY KEY (component_name, + version)` constraint instead of relying on self-healing rescans. Concurrent + publishes of the same version → one wins, the other gets a clear + "already exists" error; different components never contend. + +### What it changes — and what it doesn't + +The metadata store **only** owns the index of which `name@version` pairs exist. +It does **not** change: + +- **Static files** — component bundles still live in the configured storage + adapter (S3, GS, Azure Blob, …). Storage is still required. +- **`package.json` files** — still fetched from storage by `getComponentInfo`. +- **The hot read path** — component reads are still served from OC's in-memory + cache and never hit the database. + +The metadata store contains only four fields per version: component name, +version, publish date, and template size — exactly what the in-memory caches +need. + +> Metadata mode is **opt-in and non-breaking**. With no `metadata` block in the +> configuration, the registry behaves byte-for-byte as before (storage-only +> mode). + +## How it works + +The store is a pluggable adapter, injected the same way as the `storage.adapter`. +The registry core takes zero database dependencies. The contract each adapter +implements is exported by the `oc-metadata-adapters-utils` package: + +```ts +type ComponentRow = { + name: string; + version: string; + publishDate: number; // unix seconds + templateSize?: number; +}; + +interface MetadataStore { + adapterType: string; + isValid(): boolean; // synchronous config sanity check + initialise(): Promise; // open pool; ensure/verify schema + getAllComponents(): Promise; // hydration — feeds the caches + addVersion(row: ComponentRow): Promise; // commit point + close?(): Promise; // optional: release pool on shutdown +} +``` + +Runtime behavior in metadata mode: + +- **Startup** — the registry initialises the store, then hydrates its in-memory + list and details caches from a single `getAllComponents()` call. If the store + cannot be initialised or queried, startup fails (fail-closed). +- **Reads** — served entirely from the in-memory cache; hot component reads never + touch the database. +- **Polling** — the cache is re-hydrated from `getAllComponents()` on the polling + interval. If a poll fails, the registry keeps serving the last good in-memory + cache and retries on the next interval. +- **Publish** uses a reservation state machine so a failed/concurrent publish can + never clobber another: + 1. validate the publish + 2. write the `package.json` + 3. **reserve** the metadata row + 4. upload the statics to storage + 5. **commit** the metadata row + A duplicate or in-progress reservation surfaces as the usual + "component version already exists" publish error. If the upload or commit + fails, the reservation is best-effort aborted; any orphaned statics are + harmless unreferenced bytes and a re-publish is idempotent. +- **Shutdown** — `registry.close(callback)` closes the HTTP server and then calls + the adapter's optional `close()` hook so it can release its connection pool. + +## Configuration + +Add a `metadata` block as a sibling to `storage`. Storage remains required. + +```js +const azureSqlMetadataAdapter = require("oc-azure-sql-metadata-adapter").default; +const s3StorageAdapter = require("oc-s3-storage-adapter"); + +registry.configure({ + storage: { + adapter: s3StorageAdapter, + options: { + bucket: process.env.OC_STORAGE_BUCKET, + region: process.env.OC_STORAGE_REGION, + componentsDir: "components", + path: process.env.OC_STORAGE_BASE_URL, + }, + }, + metadata: { + adapter: azureSqlMetadataAdapter, + options: { + connectionString: process.env.OC_METADATA_SQL_CONNECTION_STRING, + }, + manageSchema: true, + reconcileFromStorage: false, + exportLegacyFiles: false, + }, +}); +``` + +### `metadata` options + +| Parameter | Type | Mandatory | Default | Description | +| ------------------------------------ | ---------------- | --------- | ------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `metadata` | object | no | - | Presence enables metadata mode. Absent → storage-only mode (default). | +| `metadata.adapter` | function | yes\* | - | Metadata adapter factory returning a `MetadataStore` (\*required if using metadata). | +| `metadata.options` | object | yes\* | - | Connection / pool options passed to the adapter. | +| `metadata.manageSchema` | boolean | no | `true` | When `true`, the adapter auto-creates its table/index if missing. Set `false` for locked-down databases where operators manage DDL; the adapter then only verifies the schema. | +| `metadata.reconcileFromStorage` | boolean | no | `false` | Bake-in flag. On startup, scan storage and idempotently insert any `name@version` present in the directory tree but missing from the database (existing rows skipped). | +| `metadata.exportLegacyFiles` | boolean | no | `false` | Bake-in / DR flag. On startup, write database-derived `components.json` and `components-details.json` projections back to storage. | +| `metadata.exportLegacyFilesInterval` | number (seconds) | no | - | When set (and `exportLegacyFiles` is `true`), also refresh those projections on a non-overlapping background timer. Omit to export on startup only. The timer is cleared on shutdown. | + +The bake-in flags (`reconcileFromStorage`, `exportLegacyFiles`, +`exportLegacyFilesInterval`) are only needed while migrating; see +[Migrating an existing registry](#migrating-an-existing-registry). + +> The legacy file export is **decoupled from the publish path** — a publish never +> triggers a full-registry export, so publishing stays an `O(1)` append. The +> exported files are a **one-directional projection** of the database (DB → files, +> never read back to mutate the DB). They exist for rollback and as a +> cold-start / DR snapshot; they do **not** replace the storage adapter, which +> still holds all component statics. + +## Available adapters + +### Azure SQL / SQL Server — `oc-azure-sql-metadata-adapter` + +A connection-pool-based (`mssql`) adapter. + +```sh +npm install oc-azure-sql-metadata-adapter +``` + +```js +metadata: { + adapter: require("oc-azure-sql-metadata-adapter").default, + options: { + connectionString: process.env.OC_METADATA_SQL_CONNECTION_STRING, + }, +} +``` + +You can also pass object connection settings instead of a connection string +(`server`, `database`, `user`, `password`, nested `options`, etc. — passed +through to `mssql`). When no `connectionString`, `password`, or explicit +`authentication` is provided, the adapter defaults to Microsoft Entra ID +(`azure-active-directory-default`), so it can connect using an ambient managed +identity with **no secret in config** (pass `clientId` for a user-assigned +identity). + +Adapter-specific options: `manageSchema` (default `true`), `schemaName` (default +`dbo`), and `tableName` (default `oc_components`). Identifiers must match +`/^[A-Za-z_][A-Za-z0-9_]*$/`. + +With `manageSchema: true` the adapter creates (roughly): + +```sql +CREATE TABLE [dbo].[oc_components] ( + component_name NVARCHAR(255) NOT NULL, + version NVARCHAR(64) NOT NULL, + publish_date BIGINT NOT NULL, + template_size BIGINT NULL, + status NVARCHAR(16) NOT NULL DEFAULT N'committed', + publish_token NVARCHAR(64) NULL, + created_at DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME(), + updated_at DATETIME2 NOT NULL DEFAULT SYSUTCDATETIME(), + PRIMARY KEY (component_name, version) +); +CREATE INDEX ix_oc_components_name ON [dbo].[oc_components] (component_name); +``` + +The primary key is the concurrency guard: same-version unique violations +(`2627` / `2601`) are mapped to the shared duplicate / in-progress error codes +before any storage upload happens. + +### Azure Table Storage — `oc-azure-table-metadata-adapter` + +A schemaless, HTTP-based adapter (`@azure/data-tables`). If you already use Azure +Blob Storage for statics, you can reuse the **same storage account** for the +metadata table — no second database to provision. Its `PartitionKey + RowKey` +uniqueness is exactly the concurrency model the metadata store needs. + +```sh +npm install oc-azure-table-metadata-adapter +``` + +```js +metadata: { + adapter: require("oc-azure-table-metadata-adapter").default, + options: { + connectionString: process.env.OC_METADATA_TABLE_CONNECTION_STRING, + }, +} +``` + +Authentication precedence when `connectionString` is absent: +`accountName` + `accountKey` → `sasToken` → explicit `credential` → +`DefaultAzureCredential` (managed identity / workload identity / `az login`), so +the registry can run with no secret. Other options include `endpoint`, +`tableName` (default `occomponents`), `manageSchema` (default `true`), and +`allowInsecureConnection` (for Azurite / local development). The adapter maps a +`409 Conflict` to the shared `VERSION_ALREADY_EXISTS` code. + +### Writing a custom adapter + +Implement the contract from `oc-metadata-adapters-utils`. Each adapter maps its +driver's unique-violation to the shared error codes (e.g. SQL Server +`2627`/`2601`, Postgres `23505`, MySQL `1062`): + +```ts +import type { ComponentRow, MetadataStore } from "oc-metadata-adapters-utils"; +import { + VERSION_ALREADY_EXISTS, + VERSION_PUBLISH_IN_PROGRESS, +} from "oc-metadata-adapters-utils"; +``` + +## Migrating an existing registry + +Migration is **gradual and lossless** — storage stays authoritative-enough +throughout the window so you can roll back at any point. + +### 1. Backfill the database + +Before serving traffic in metadata mode, populate the store from your existing +index using the CLI command: + +```sh +oc registry migrate-metadata ./registry.config.js +``` + +The argument is a path to a module exporting the **same config object** you pass +to `registry.configure()`. It must include both `storage` and `metadata` and pass +registry config validation. The module may be CommonJS, an ES module `default` +export, or an **async function** returning the config (useful for resolving +secrets first): + +```js +// registry.config.js — CommonJS +const azureSqlMetadataAdapter = require("oc-azure-sql-metadata-adapter").default; +const s3StorageAdapter = require("oc-s3-storage-adapter"); + +module.exports = { + baseUrl: "http://my-registry.example.com/", + storage: { + adapter: s3StorageAdapter, + options: { + bucket: "my-bucket", + region: "us-east-1", + componentsDir: "components", + path: "https://cdn.example.com/", + }, + }, + metadata: { + adapter: azureSqlMetadataAdapter, + options: { + connectionString: process.env.OC_METADATA_SQL_CONNECTION_STRING, + }, + }, +}; +``` + +The command initialises the adapter and backfills rows from +`${componentsDir}/components-details.json` (which already *is* the +`ComponentRow` set). If that file is missing, it falls back to scanning +`${componentsDir}///package.json`. Existing rows are skipped, +so the command is **idempotent** and safe to re-run across nodes. It logs +`{ scanned, inserted, skipped }` and closes the adapter pool on exit (even on +failure). + +### 2. Cut over + +Deploy with the `metadata` block configured. During the migration window enable +the bake-in flags: + +```js +metadata: { + adapter: azureSqlMetadataAdapter, + options: { /* ... */ }, + reconcileFromStorage: true, // heal anything published by still-storage-mode nodes + exportLegacyFiles: true, // keep components.json fresh for rollback / external consumers +} +``` + +Nodes now hydrate from the database. A safe rollout order: + +1. Deploy the metadata config to a non-serving environment. +2. Run `oc registry migrate-metadata ./registry.config.js`. +3. Start **one** registry instance in metadata mode and verify reads. +4. Roll out metadata mode to the remaining instances. + +### 3. Bake-in + +Run mixed / observe. While the bake-in flags are on: + +- **`reconcileFromStorage`** upserts, on each boot, any `name@version` that exists + in the directory tree but is missing from the database — healing anything a + node still running in storage mode published during the cutover. +- **`exportLegacyFiles`** (optionally on `exportLegacyFilesInterval`) keeps + `components.json` / `components-details.json` fresh in storage, so external + consumers keep working and **rollback to storage mode loses at most one export + interval**. + +The directory tree remains authoritative-enough that the reconcile can heal any +miss — the same self-healing principle storage mode relies on, applied +deliberately once at the boundary. + +### 4. Steady state + +Once you're confident, drop the bake-in scaffolding: + +- Set `reconcileFromStorage: false` (the directory scan is now abandoned). +- Optionally keep `exportLegacyFiles: true` permanently as a cheap, one-way DR + snapshot / cold-start fallback so the database is never an absolute single + point of failure for booting. + +`components.json` is now a **non-authoritative projection** of the database. + +### Rolling back + +Because the legacy files are kept fresh during bake-in, rolling back is simply +removing the `metadata` block (or reverting the deploy): the registry returns to +storage mode and reads the projected `components.json`, losing at most one export +interval of updates. + +## Failure model + +- **Startup, DB down** — readiness fails and the node retries with backoff; + already-running nodes keep serving from cache and the load balancer skips the + not-ready node. The registry never starts silently empty. +- **Poll, DB blip** — fully resilient: keep serving the in-memory cache, log, and + retry next interval. The only effect is that new publishes propagate slightly + later. +- **Publish, DB unreachable** — the publish fails with a clear error; any statics + uploaded are harmless orphans and the client can retry (idempotent). There is + no buffering. + +Net: **reads survive any DB blip, and publishes correctly refuse during one.** diff --git a/website/sidebars.js b/website/sidebars.js index 9c01926..820e444 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -66,6 +66,7 @@ const sidebars = { items: [ "components/publishing-to-a-registry", "registry/registry-configuration", + "registry/registry-metadata-stores", "registry/registry-using-google-storage", ], }, diff --git a/website/static/llms.txt b/website/static/llms.txt index 59af647..cd91baf 100644 --- a/website/static/llms.txt +++ b/website/static/llms.txt @@ -38,6 +38,7 @@ The framework consists of four core components: CLI & Development Tools for comp ## Registry & Infrastructure - [Registry Configuration](https://opencomponents.github.io/docs/registry/registry-configuration.md): Production registry setup with S3, authentication, and advanced options +- [Metadata Stores](https://opencomponents.github.io/docs/registry/registry-metadata-stores.md): Opt-in pluggable database as source of truth for the component index, plus migration steps for existing registries - [Google Cloud Storage](https://opencomponents.github.io/docs/registry/registry-using-google-storage.md): Alternative storage configuration - [Client Setup](https://opencomponents.github.io/docs/consumers/client-setup.md): Consumer application configuration