cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/) by rgarcia · Pull Request #40 · kernel/cua

rgarcia · 2026-06-26T14:59:27Z

Summary

Repurposes this PR from the standalone Online-Mind2Web runner to the reusable shared core that connects Harbor's Kernel environment to the cua agent loop. Everything lives in kernel/cua under a new top-level ./benchmarks/; kernel/harbor is not modified — the agent is loaded by import path.

What's here

benchmarks/src/cua_harbor/ — CuaHarborAgent(BaseAgent), loaded via --agent-import-path cua_harbor:CuaHarborAgent (Harbor's AgentFactory.create_agent_from_import_path, so no enum/factory/fork edit). It spawns the Node entrypoint on the host, plumbs the Kernel session id/key + model + provider key, writes /logs/agent/answer.txt, and maps the run log to an ATIF trajectory.
benchmarks/node/ — the cua-bench-task entrypoint: attaches to the Kernel session via client.browsers.retrieve(KERNEL_SESSION_ID) (never creates/deletes it), runs CuaAgentHarness, and emits the answer + per-step screenshots + an ATIF-mappable run.jsonl. Depends on the published @onkernel/* packages.
uv Python package; harbor[kernel] pulled as a git dependency from the fork branch that carries the Kernel env (hypeship/kernel-environment).
A minimal example task (examples/tasks/cua-hello) + Node/Python tests.

Design

The Kernel environment stays in kernel/harbor (a candidate for upstream contribution), used at runtime via -e kernel. Only the Kernel-specific agent + (later) benchmark adapters live here.
Harbor consumed unmodified; agent loaded by import path.

Verification

benchmarks/node: build + tsc typecheck + vitest green. benchmarks: uv sync (resolves harbor[kernel] from the git fork branch) + ruff + pytest green (Python tests mock the Node subprocess).
Live end-to-end smoke passed: uv run harbor run -p examples/tasks/cua-hello -e kernel --agent-import-path cua_harbor:CuaHarborAgent -m anthropic/claude-opus-4-8 → reward 1.0, 0 exceptions (real Kernel browser, opus-4-8 drives example.com, verifier checks the answer). The live run caught and fixed a missing harbor[kernel] extra that the mocked tests didn't.

Scope / next

Shared core only. The three benchmark adapters (Online-Mind2Web, WebVoyager, ClawBench) land as later branches off this one, each adding benchmarks/adapters/<name>/ (a Python task generator + verifier) on top of this core.

Note

Medium Risk
New benchmark integration that forwards API keys and Kernel session credentials into a host Node subprocess and pins Harbor to a git fork branch; scope is isolated to benchmarks/ but runtime depends on external services and keys.

Overview
Adds a new benchmarks/ package so the cua computer-use agent can run as a Harbor agent on the Kernel browser environment, without changing Harbor itself—load via --agent-import-path cua_harbor:CuaHarborAgent.

CuaHarborAgent runs on the host: it reads KERNEL_SESSION_ID / KERNEL_API_KEY from the environment, maps Harbor provider/name to cua provider:name, forwards provider keys from --ae, and spawns the bundled Node task.js. The entrypoint attaches to the existing Kernel session (browsers.retrieve), runs CuaAgentHarness, and writes /logs/agent artifacts: answer.txt, shots/, and run.jsonl. Python then maps run.jsonl to ATIF trajectory.json and fills AgentContext token/cost fields.

Also includes the cua-hello smoke task (example.com heading), docs, harbor[kernel] as a git dependency on the Kernel-env branch, wheel packaging of built node/dist, and Node/Python tests (subprocess mocked in pytest).

^{Reviewed by Cursor Bugbot for commit 455ac06. Bugbot is set up for automated code reviews on this repo. Configure here.}

Introduces @onkernel/cua-bench, an extensible web-agent benchmark runner that drives the cua-agent loop against Kernel cloud browsers and grades trajectories with a configurable LLM judge. Online-Mind2Web is the first benchmark, graded by a ported WebJudge; a registry interface lets more benchmarks drop in. Wires the package into the npm workspace, build chain, and tsconfig project references. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace the standalone packages/bench with ./benchmarks: the reusable core connecting Harbor's Kernel environment to the cua agent loop. - benchmarks/src/cua_harbor: CuaHarborAgent (BaseAgent), loaded by Harbor via --agent-import-path; no harbor fork changes needed (import-path resolution). - benchmarks/node: the cua-bench-task entrypoint that attaches to the Kernel session via browsers.retrieve, runs CuaAgentHarness, and emits the answer, screenshots, and an ATIF-mappable event log under /logs/agent. - harbor[kernel] pulled as a uv git dependency; minimal example task + tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes using high effort and found 3 potential issues.

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

✅ Fixed: First prompt lacks screenshot
- The Node benchmark entrypoint now captures an initial browser screenshot and passes it via harness.prompt(..., { images }) on the first turn.
✅ Fixed: Node failure treated as success
- CuaHarborAgent.run now raises a RuntimeError when the Node subprocess exits non-zero so failed runs are surfaced as failures instead of continuing.
✅ Fixed: Ignores harness error stopReason
- The Node benchmark entrypoint now checks assistant.stopReason after harness.prompt and throws on error or aborted to fail the run.

Or push these changes by commenting:

@cursor push 90be33c080

Preview (90be33c080)

diff --git a/benchmarks/node/src/task.ts b/benchmarks/node/src/task.ts
--- a/benchmarks/node/src/task.ts
+++ b/benchmarks/node/src/task.ts
@@ -1,7 +1,7 @@
 import { writeFileSync } from "node:fs";
 import { join } from "node:path";
 import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent";
-import { type CuaModelRef, requireCuaEnvApiKeyForModel } from "@onkernel/cua-ai";
+import { type CuaModelRef, type ImageContent, requireCuaEnvApiKeyForModel } from "@onkernel/cua-ai";
 import Kernel from "@onkernel/sdk";
 import { extractFinalAnswer } from "./answer.ts";
 import { attachAtifSink, writeFinalLine, writeUserLine } from "./sink.ts";
@@ -34,7 +34,8 @@
   requireCuaEnvApiKeyForModel(model);
 
   const client = new Kernel({ apiKey: requireEnv("KERNEL_API_KEY") });
-  const browser = await client.browsers.retrieve(requireEnv("KERNEL_SESSION_ID"));
+  const kernelSessionId = requireEnv("KERNEL_SESSION_ID");
+  const browser = await client.browsers.retrieve(kernelSessionId);
   const session = await new InMemorySessionRepo().create({ id: taskId });
   const harness = new CuaAgentHarness({
     browser,
@@ -48,7 +49,11 @@
   writeUserLine(outDir, instruction);
   const unsubscribe = attachAtifSink({ harness, outDir });
   try {
-    await harness.prompt(instruction);
+    const images = await captureInitialScreenshot(client, kernelSessionId);
+    const assistant = await harness.prompt(instruction, images ? { images } : undefined);
+    if (assistant.stopReason === "error" || assistant.stopReason === "aborted") {
+      throw new Error(assistant.errorMessage ?? `agent stopped with ${assistant.stopReason}`);
+    }
   } finally {
     unsubscribe();
   }
@@ -59,6 +64,16 @@
   writeFinalLine(outDir, { answer, session_id: taskId, model, agent_version: CUA_AGENT_VERSION });
 }
 
+async function captureInitialScreenshot(client: Kernel, sessionId: string): Promise<ImageContent[] | undefined> {
+  try {
+    const screenshot = await client.browsers.computer.captureScreenshot(sessionId);
+    const image = Buffer.from(await screenshot.arrayBuffer()).toString("base64");
+    return [{ type: "image", data: image, mimeType: "image/png" }];
+  } catch {
+    return undefined;
+  }
+}
+
 main().catch((err) => {
   console.error(err instanceof Error ? err.stack ?? err.message : String(err));
   process.exit(1);

diff --git a/benchmarks/src/cua_harbor/agent.py b/benchmarks/src/cua_harbor/agent.py
--- a/benchmarks/src/cua_harbor/agent.py
+++ b/benchmarks/src/cua_harbor/agent.py
@@ -105,6 +105,9 @@
             self.logger.warning(
                 f"cua entrypoint exited with code {proc.returncode}; see {stderr_path}"
             )
+            raise RuntimeError(
+                f"cua entrypoint exited with code {proc.returncode}; see {stderr_path}"
+            )
 
         self._ensure_answer_file()
         # Populate context now so a later timeout still leaves token/cost metrics.

diff --git a/benchmarks/tests/test_agent.py b/benchmarks/tests/test_agent.py
--- a/benchmarks/tests/test_agent.py
+++ b/benchmarks/tests/test_agent.py
@@ -100,3 +100,21 @@
     agent = _make_agent(tmp_path)
     with pytest.raises(RuntimeError, match="Kernel session not started"):
         await agent.run("hi", EmptyEnv(), AgentContext())
+
+
+async def test_run_raises_when_node_entrypoint_fails(tmp_path, fake_env, monkeypatch):
+    class _FailedProc:
+        returncode = 17
+
+        async def wait(self) -> None:
+            return None
+
+    async def fake_exec(program, *args, env=None, stdout=None, stderr=None):
+        return _FailedProc()
+
+    monkeypatch.setattr("cua_harbor.agent.asyncio.create_subprocess_exec", fake_exec)
+    monkeypatch.setattr(CuaHarborAgent, "_bundle_path", lambda self: Path("/fake/task.js"))
+
+    agent = _make_agent(tmp_path)
+    with pytest.raises(RuntimeError, match="cua entrypoint exited with code 17"):
+        await agent.run("Go to example.com", fake_env, AgentContext())

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 455ac06. Configure here.}

cursor · 2026-06-27T13:40:10Z

+  writeUserLine(outDir, instruction);
+  const unsubscribe = attachAtifSink({ harness, outDir });
+  try {
+    await harness.prompt(instruction);


First prompt lacks screenshot

Medium Severity

The Node entrypoint starts a fresh harness session with harness.prompt(instruction) and never passes { images }. For non-yutori providers the first turn is sent without the current browser frame, unlike the CLI’s maybeInitialScreenshot pattern, so models that rely on an initial screenshot (or observe-style prompts) can fail or waste a tool round even when Kernel already navigated to the start URL.

^{Triggered by learned rule: Harness prompt calls must attach first-prompt screenshot for non-yutori providers}

^{Reviewed by Cursor Bugbot for commit 455ac06. Configure here.}

cursor · 2026-06-27T13:40:10Z

+
+        self._ensure_answer_file()
+        # Populate context now so a later timeout still leaves token/cost metrics.
+        self.populate_context_post_run(context)


Node failure treated as success

Medium Severity

If the Node entrypoint exits non-zero, CuaHarborAgent.run logs a warning but continues execution without raising an exception. This allows _ensure_answer_file and populate_context_post_run to run, potentially leaving answer.txt empty or partial. This behavior can mask agent failures from Harbor, leading to misleading task outcomes and metrics.

^{Reviewed by Cursor Bugbot for commit 455ac06. Configure here.}

cursor · 2026-06-27T13:40:10Z

+  const branch = await session.getBranch();
+  const answer = extractFinalAnswer(branch);
+  writeFileSync(join(outDir, "answer.txt"), answer);
+  writeFinalLine(outDir, { answer, session_id: taskId, model, agent_version: CUA_AGENT_VERSION });


Ignores harness error stopReason

Medium Severity

After harness.prompt, the entrypoint never inspects the returned assistant message for stopReason of error or aborted. A failed agent turn can still write answer.txt, emit final in run.jsonl, and exit 0, so Harbor treats the agent phase as successful while grading may use an empty or misleading answer.

^{Reviewed by Cursor Bugbot for commit 455ac06. Configure here.}

rgarcia and others added 2 commits June 26, 2026 14:58

rgarcia changed the title ~~Add packages/bench with an Online-Mind2Web benchmark runner~~ cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/) Jun 26, 2026

This was referenced Jun 27, 2026

WebVoyager Harbor adapter (Kernel env + cua agent) #42

Draft

Online-Mind2Web adapter (Kernel env + cua agent) #43

Draft

ClawBench adapter (Kernel env + cua agent) #44

Draft

rgarcia marked this pull request as ready for review June 27, 2026 13:36

cursor Bot reviewed Jun 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/)#40

cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/)#40
rgarcia wants to merge 2 commits into
mainfrom
hypeship/cua-bench-online-mind2web

rgarcia commented Jun 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

cursor Bot Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rgarcia commented Jun 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's here

Design

Verification

Scope / next

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

First prompt lacks screenshot

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Node failure treated as success

Uh oh!

cursor Bot Jun 27, 2026

Choose a reason for hiding this comment

Ignores harness error stopReason

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rgarcia commented Jun 26, 2026 •

edited by cursor Bot

Loading

cursor Bot left a comment •

edited

Loading