Skip to content

[orchestrator-v2] Fix shell cache persistence for fast startup#3640

Open
mwolson wants to merge 1 commit into
pingdotgg:t3code/codex-turn-mappingfrom
mwolson:fix/shell-cache-persist
Open

[orchestrator-v2] Fix shell cache persistence for fast startup#3640
mwolson wants to merge 1 commit into
pingdotgg:t3code/codex-turn-mappingfrom
mwolson:fix/shell-cache-persist

Conversation

@mwolson

@mwolson mwolson commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Ignore Grok ACP JSON-RPC responses with non-numeric ids before they reach the Effect RPC client. This keeps agent-internal messages such as skills-reload from breaking provider startup.
  • Persist the latest live environment shell snapshot when the client disconnects or the shell state scope closes.
  • Store orchestration cache envelopes with v2 JSON DateTime schemas so shell and thread cache rows round-trip after restart.
  • Let startup hydrate projects and thread shells from IndexedDB while live subscribeShell synchronization is still starting.

Startup Impact Observed

Local AppImage validation showed three separate startup-cache blockers:

  • Grok can emit ACP JSON-RPC responses with agent-internal string ids. Before this fix, those responses reached the Effect RPC client and could fail numeric id decoding during provider startup.
  • Before the shell flush fix, the active environment had thread cache rows but no shell row, so project and thread shell rendering had to wait for live synchronization.
  • After the flush path was instrumented, the AppImage wrote a shell row for environment 61b9d3bf-... with 17 projects, 5 active threads, 1 archived thread, and snapshot sequence 2195612.
  • A later restart read that row but discarded it because latestVisibleMessage.updatedAt was stored as an ISO string while the cache envelope decoded it as DateTime.Utc. This branch now fixes that JSON boundary and adds a regression test that round-trips the cached JSON shape.

With these fixes in place, a local AppImage restart on July 2, 2026 showed fast startup spans: desktop.startup completed in about 327ms, server.startup in about 350ms, orchestration v2 projection verification in about 238ms, and orchestration v2 recovery in about 82ms.

Problem and Fix

Problem and Why it Happened Fix
Grok can emit ACP JSON-RPC responses for agent-internal messages such as skills-reload with non-numeric ids. Those responses are not valid Effect RPC responses and could fail BigInt id decoding during provider startup. Drop non-numeric response ids in the ACP transport wrapper before forwarding messages into the Effect RPC client.
Shell snapshots were persisted through a 500ms debounced queue. A quit or disconnect before the debounce fired could leave IndexedDB without a shell row for the current environment, even when thread cache rows existed. Flush the latest live shell snapshot when the environment supervisor disconnects, before the state falls back to cached.
Thread cache state already persisted on scope close, but shell state did not. Scope shutdown could drop the latest live shell snapshot. Add a shell state finalizer that persists the latest live snapshot on scope close.
A live snapshot can remain available while transient shell state moves back to synchronizing. A disconnect in that window could still skip the flush. Track the latest live shell snapshot separately from display status and flush that snapshot on disconnect or scope close.
The UI state could publish a newer live snapshot just before shutdown while the flush source still pointed at the previous snapshot. Update the latest live snapshot ref before publishing the live state.
An older async IndexedDB write can overlap with a newer live shell update. If the older write finishes last, IndexedDB can be left behind. After each shell cache save, re-check the latest live snapshot sequence and immediately write the newer snapshot before returning.
Persisted cache rows cross a JSON storage boundary, but the shell and thread cache envelopes used runtime v2 DateTime schemas. Rows with nested DateTime fields decoded as corrupt and were discarded on restart. Add v2 JSON schemas for cached shell and thread projection shapes and use them in the cache envelopes.

What This Does Not Change

  • This is a client cache persistence and ACP response-filtering fix. It does not rebuild or repair server projection data, unreadable v2 thread projections, backend startup failures, or SSH environment checks.
  • If no live shell snapshot has been received in the current run, there is still nothing to persist.
  • Live subscribeShell remains authoritative after startup hydration.

Validation

  • vp test packages/effect-acp/src/protocol.test.ts packages/client-runtime/src/state/shell-sync.test.ts packages/client-runtime/src/platform/orchestrationCache.test.ts passed
  • vp check passed
  • vp run typecheck passed
  • Local AppImage restart logs on July 2, 2026 showed desktop.startup at about 327ms and server.startup at about 350ms

Note

Fix shell cache persistence to guarantee latest snapshot is saved on disconnect

  • Adds JSON-mapped schema variants (OrchestrationV2ShellSnapshotJson, OrchestrationV2ThreadProjectionJson, etc.) in orchestrationV2.ts so shell and thread snapshots serialize date fields as ISO strings.
  • Updates StoredOrchestrationShellSnapshot and StoredOrchestrationThreadSnapshot in orchestrationCache.ts to encode/decode against these JSON-mapped schemas.
  • Rewrites persist logic in shell.ts to track the latest live snapshot via a Ref and loop until the newest snapshot is persisted, preventing stale saves when updates arrive faster than the debounce.
  • Flushes the latest live snapshot to persistence when the environment disconnects or the scope finalizes, ensuring no snapshot is lost at shutdown.
  • Drops JSON-RPC responses with non-numeric request IDs in protocol.ts instead of forwarding them to the RPC client queue.

Macroscope summarized daa9283.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9d29eb67-f55d-4c36-a4df-92e7cd8c31b5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added vouch:unvouched PR author is not yet trusted in the VOUCHED list. size:S 10-29 changed lines (additions + deletions). labels Jul 2, 2026
Comment thread packages/client-runtime/src/state/shell.ts
@mwolson mwolson changed the title fix(client-runtime): flush shell cache on disconnect [orchestrator-v2] fix(client-runtime): flush shell cache on disconnect Jul 2, 2026
@mwolson mwolson changed the title [orchestrator-v2] fix(client-runtime): flush shell cache on disconnect [orchestrator-v2] fix(client-runtime): flush shell cache on disconnect to retain fast startup Jul 2, 2026
@mwolson mwolson changed the title [orchestrator-v2] fix(client-runtime): flush shell cache on disconnect to retain fast startup fix(client-runtime): Flush shell cache on disconnect to retain fast startup Jul 2, 2026
@mwolson mwolson marked this pull request as ready for review July 2, 2026 01:57

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8866ebd. Configure here.

Comment thread packages/client-runtime/src/state/shell.ts
@macroscopeapp

macroscopeapp Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Approvability

Verdict: Needs human review

2 blocking correctness issues found. This PR modifies cache persistence timing and adds concurrent state handling. Unresolved review comments identify potential blocking/hanging bugs in the finalizer loop and disconnect handler that should be addressed before merging.

You can customize Macroscope's approvability policy. Learn more.

@mwolson mwolson force-pushed the fix/shell-cache-persist branch from 8866ebd to 191408f Compare July 2, 2026 02:03
@mwolson mwolson changed the title fix(client-runtime): Flush shell cache on disconnect to retain fast startup Fix shell cache flush on disconnect to retain fast startup Jul 2, 2026
macroscopeapp[bot]
macroscopeapp Bot previously approved these changes Jul 2, 2026
@mwolson mwolson force-pushed the fix/shell-cache-persist branch from 191408f to d6a0e95 Compare July 2, 2026 02:09
@macroscopeapp macroscopeapp Bot dismissed their stale review July 2, 2026 02:09

Dismissing prior approval to re-evaluate d6a0e95

@github-actions github-actions Bot added size:M 30-99 changed lines (additions + deletions). and removed size:S 10-29 changed lines (additions + deletions). labels Jul 2, 2026
Comment thread packages/client-runtime/src/state/shell.ts
@mwolson mwolson force-pushed the fix/shell-cache-persist branch from d6a0e95 to 275c738 Compare July 2, 2026 02:16
@mwolson mwolson changed the title Fix shell cache flush on disconnect to retain fast startup [orchestrator-v2] Fix shell cache flush on disconnect to retain fast startup Jul 2, 2026
@mwolson mwolson marked this pull request as draft July 2, 2026 02:56
@mwolson mwolson changed the title [orchestrator-v2] Fix shell cache flush on disconnect to retain fast startup [orchestrator-v2] Fix shell cache persistence for fast startup Jul 2, 2026
@mwolson mwolson force-pushed the fix/shell-cache-persist branch from 275c738 to a9a74af Compare July 2, 2026 04:10
@github-actions github-actions Bot added size:L 100-499 changed lines (additions + deletions). and removed size:M 30-99 changed lines (additions + deletions). labels Jul 2, 2026
@mwolson mwolson marked this pull request as ready for review July 2, 2026 04:11
yield* persist(snapshot.value);
});

const setDisconnected = flushLiveShellSnapshot.pipe(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium state/shell.ts:116

setDisconnected now calls flushLiveShellSnapshot — which synchronously runs cache.saveShell — before updating the SubscriptionRef status. Because setDisconnected runs inside Stream.runForEach on supervisor.state, a slow or hanging persistence backend blocks the stream, so the shell state stays stuck at "live" and stops processing later connection transitions until disk I/O completes. Consider making setDisconnected update the status first and flush the snapshot in the background, or fork the flush so connection transitions are not gated on persistence.

🤖 Copy this AI Prompt to have your agent fix this:
In file @packages/client-runtime/src/state/shell.ts around line 116:

`setDisconnected` now calls `flushLiveShellSnapshot` — which synchronously runs `cache.saveShell` — before updating the `SubscriptionRef` status. Because `setDisconnected` runs inside `Stream.runForEach` on `supervisor.state`, a slow or hanging persistence backend blocks the stream, so the shell state stays stuck at `"live"` and stops processing later connection transitions until disk I/O completes. Consider making `setDisconnected` update the status first and flush the snapshot in the background, or fork the flush so connection transitions are not gated on persistence.

@mwolson mwolson marked this pull request as draft July 2, 2026 04:32
@mwolson mwolson force-pushed the fix/shell-cache-persist branch from a9a74af to fcce425 Compare July 2, 2026 04:40
Ignore Grok ACP JSON-RPC responses with non-numeric ids before they reach Effect RpcClient so agent-internal messages such as skills-reload do not break provider startup.

Persist and round-trip the latest live shell snapshot through IndexedDB so project and thread shells can hydrate from cache before live synchronization catches up.
@mwolson mwolson force-pushed the fix/shell-cache-persist branch from fcce425 to daa9283 Compare July 2, 2026 05:02
@mwolson mwolson marked this pull request as ready for review July 2, 2026 05:06
Effect.forkScoped,
);

yield* Effect.addFinalizer(() => flushLiveShellSnapshot);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium state/shell.ts:201

The scope finalizer calls flushLiveShellSnapshot, which runs persist's while (true) loop until latestLiveSnapshot stops advancing. But the finalizer runs before the forked subscription fiber is interrupted, so applyItem can keep setting latestLiveSnapshot (lines 171-177) with newer sequences while cache.saveShell is still in progress. Each iteration sees a newer snapshot, so the loop's exit condition (latestSnapshot.value.snapshotSequence <= nextSnapshot.snapshotSequence) never holds, and the finalizer hangs indefinitely — blocking environment teardown and app shutdown. Consider stopping the subscription stream before flushing, or having flushLiveShellSnapshot read and clear latestLiveSnapshot atomically so concurrent applyItem calls can't keep the loop alive.

🤖 Copy this AI Prompt to have your agent fix this:
In file @packages/client-runtime/src/state/shell.ts around line 201:

The scope finalizer calls `flushLiveShellSnapshot`, which runs `persist`'s `while (true)` loop until `latestLiveSnapshot` stops advancing. But the finalizer runs before the forked subscription fiber is interrupted, so `applyItem` can keep setting `latestLiveSnapshot` (lines 171-177) with newer sequences while `cache.saveShell` is still in progress. Each iteration sees a newer snapshot, so the loop's exit condition (`latestSnapshot.value.snapshotSequence <= nextSnapshot.snapshotSequence`) never holds, and the finalizer hangs indefinitely — blocking environment teardown and app shutdown. Consider stopping the subscription stream before flushing, or having `flushLiveShellSnapshot` read and clear `latestLiveSnapshot` atomically so concurrent `applyItem` calls can't keep the loop alive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100-499 changed lines (additions + deletions). vouch:unvouched PR author is not yet trusted in the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant