Skip to content

Fix use-after-free in async host ledger reads on shutdown#8003

Draft
Copilot wants to merge 6 commits into
mainfrom
copilot/fix-potential-async-cleanup-problems
Draft

Fix use-after-free in async host ledger reads on shutdown#8003
Copilot wants to merge 6 commits into
mainfrom
copilot/fix-potential-async-cleanup-problems

Conversation

Copilot AI commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Async uv_work_t ledger reads in src/host/ledger.h could have their worker and completion callbacks run after the host Ledger was destroyed during shutdown (uv_stop ends the main loop, then the post-loop flushing iterations run after Ledger has left scope in run_main_loop). The worker dereferenced data->ledger, a use-after-free. Per @eddyashton's analysis, the other flagged sites (timer.h, process_launcher.h, Signal, HTTPParser) are already safe via libuv proxy/close pointers or run-scoped lifetimes.

Changes (all in src/host/ledger.h)

  • AsyncReadState control block - shared_ptr-owned by Ledger and copied into every outstanding AsyncLedgerGet. Holds a mutex, condition variable, active_workers count, and ledger_alive flag. It outlives the Ledger, so worker callbacks can safely decide whether they may still access it during shutdown.
  • on_ledger_get_async - registers as an active worker under the lock; if ledger_alive is already false it skips the read entirely (logged at DEBUG). The raw Ledger* is only dereferenced after observing ledger_alive == true and incrementing the count, which forces the destructor to wait. An RAII guard decrements active_workers and notifies waiters on every exit path from the worker.
  • ~Ledger() - sets ledger_alive = false and blocks until active_workers == 0, draining in-flight reads before member destruction.
  • The completion callback remains Ledger-independent, writing back through a captured to_enclave shared_ptr.

The lock ordering is safe: a worker either registers before the destructor checks (destructor waits, Ledger stays alive through the read) or observes ledger_alive == false and never touches the Ledger. The AsyncReadState stays alive via the worker's own shared_ptr, so member-destruction order is irrelevant.

Scope

Shutdown-only; running nodes are unaffected. A CHANGELOG.md entry is included.

Copilot AI and others added 2 commits June 29, 2026 17:32
Closes #3501

Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Closes #3501

Co-authored-by: achamayou <4016369+achamayou@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix potential async cleanup problems Fix use-after-free in async host ledger reads on shutdown Jun 29, 2026
Copilot AI requested a review from achamayou June 29, 2026 17:35
@achamayou achamayou requested a review from Copilot June 29, 2026 20:26

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a shutdown-only use-after-free risk in the host’s async ledger read path (uv_work_t threadpool jobs) by ensuring worker callbacks never dereference a Ledger* after the owning asynchost::Ledger has been destroyed. It does this by introducing a shared control block that outlives the Ledger and by having ~Ledger() wait for in-flight worker callbacks to finish before allowing member destruction.

Changes:

  • Added a shared_ptr-owned AsyncReadState (mutex + condition variable + active_workers + ledger_alive) shared between Ledger and each AsyncLedgerGet.
  • Updated on_ledger_get_async to (a) skip reads once shutdown begins and (b) register/deregister active worker callbacks so ~Ledger() can safely drain them.
  • Added a CHANGELOG entry documenting the shutdown use-after-free fix.

Custom instructions used:

  • .github/copilot-instructions.md
  • .github/instructions/reviewing.instructions.md

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/host/ledger.h Introduces AsyncReadState and shutdown coordination so async libuv workers don’t access a destroyed Ledger during/after shutdown.
CHANGELOG.md Documents the shutdown UAF fix with a PR reference.

Copy link
Copy Markdown
Member

No blocking issues found.

I reviewed PR #8003 against main, focusing on src/host/ledger.h async uv_work_t reads, the run_main_loop shutdown order, the linked #3501 concern, minimality, and the repo review instructions.

Correctness:

  • The bug is real: uv_stop in handle_ring_buffer.h exits uv_run without draining threadpool work; Ledger is stack-local in run_main_loop and is destroyed before the post-loop uv_loop_close drain. A pending or running on_ledger_get_async could previously dereference data->ledger after destruction.
  • The AsyncReadState protocol is sound: ledger_alive and active_workers are checked/updated under one mutex, so a worker either registers before the destructor can finish or observes shutdown and never touches Ledger. condition_variable::wait releases the mutex, and the decrement happens in the worker callback, so this does not depend on the stopped main loop running.
  • Completion callbacks remain Ledger-independent: result_cb only uses the captured WriterPtr and request data, and AsyncLedgerGet/uv_work_t are still deleted in completion. The captured writer chain/circuit outlives the loop-close drain.
  • I did not find lock-order inversion with state_lock/read_cache_lock; async_state->lock is not held during read_entries_range.
  • The change is local and minimal for the problem. uv_cancel alone would not cover running work, and moving the loop-close drain earlier would have broader shutdown implications.
  • The CHANGELOG.md entry is appropriate under the current Fixed section.

Non-blocking suggestions:

  • Consider making the active_workers decrement exception-safe with a small scope guard around read_entries_range (src/host/ledger.h:1763-1774). An exception escaping the libuv worker would already be fatal today, so I would not block on this, but the new destructor invariant would be more self-contained if every exit path decremented and notified.
  • The comments around AsyncReadState slightly overstate its role in making completion callbacks safe. It protects the worker's access to Ledger; completion is safe because it no longer touches Ledger and captures to_enclave by value. Tightening that wording would reduce future confusion.
  • I do not think a deterministic test is required here. The failure is a shutdown race around libuv worker scheduling; an ASAN stress test could add confidence but would likely be timing-sensitive.

CI note: At the time of review, the main VM/ACI checks for the current head were still queued or in progress; ASAN/TSAN and long jobs were skipped by workflow configuration.

Custom instructions used:

  • .github/copilot-instructions.md
  • .github/instructions/reviewing.instructions.md

Make the active worker accounting exception-safe and clarify that AsyncReadState protects worker access to Ledger while completion callbacks are Ledger-independent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants