Skip to content

cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/)#40

Open
rgarcia wants to merge 2 commits into
mainfrom
hypeship/cua-bench-online-mind2web
Open

cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/)#40
rgarcia wants to merge 2 commits into
mainfrom
hypeship/cua-bench-online-mind2web

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Repurposes this PR from the standalone Online-Mind2Web runner to the reusable shared core that connects Harbor's Kernel environment to the cua agent loop. Everything lives in kernel/cua under a new top-level ./benchmarks/; kernel/harbor is not modified — the agent is loaded by import path.

What's here

  • benchmarks/src/cua_harbor/CuaHarborAgent(BaseAgent), loaded via --agent-import-path cua_harbor:CuaHarborAgent (Harbor's AgentFactory.create_agent_from_import_path, so no enum/factory/fork edit). It spawns the Node entrypoint on the host, plumbs the Kernel session id/key + model + provider key, writes /logs/agent/answer.txt, and maps the run log to an ATIF trajectory.
  • benchmarks/node/ — the cua-bench-task entrypoint: attaches to the Kernel session via client.browsers.retrieve(KERNEL_SESSION_ID) (never creates/deletes it), runs CuaAgentHarness, and emits the answer + per-step screenshots + an ATIF-mappable run.jsonl. Depends on the published @onkernel/* packages.
  • uv Python package; harbor[kernel] pulled as a git dependency from the fork branch that carries the Kernel env (hypeship/kernel-environment).
  • A minimal example task (examples/tasks/cua-hello) + Node/Python tests.

Design

  • The Kernel environment stays in kernel/harbor (a candidate for upstream contribution), used at runtime via -e kernel. Only the Kernel-specific agent + (later) benchmark adapters live here.
  • Harbor consumed unmodified; agent loaded by import path.

Verification

  • benchmarks/node: build + tsc typecheck + vitest green. benchmarks: uv sync (resolves harbor[kernel] from the git fork branch) + ruff + pytest green (Python tests mock the Node subprocess).
  • Live end-to-end smoke passed: uv run harbor run -p examples/tasks/cua-hello -e kernel --agent-import-path cua_harbor:CuaHarborAgent -m anthropic/claude-opus-4-8reward 1.0, 0 exceptions (real Kernel browser, opus-4-8 drives example.com, verifier checks the answer). The live run caught and fixed a missing harbor[kernel] extra that the mocked tests didn't.

Scope / next

Shared core only. The three benchmark adapters (Online-Mind2Web, WebVoyager, ClawBench) land as later branches off this one, each adding benchmarks/adapters/<name>/ (a Python task generator + verifier) on top of this core.


Note

Medium Risk
New benchmark integration that forwards API keys and Kernel session credentials into a host Node subprocess and pins Harbor to a git fork branch; scope is isolated to benchmarks/ but runtime depends on external services and keys.

Overview
Adds a new benchmarks/ package so the cua computer-use agent can run as a Harbor agent on the Kernel browser environment, without changing Harbor itself—load via --agent-import-path cua_harbor:CuaHarborAgent.

CuaHarborAgent runs on the host: it reads KERNEL_SESSION_ID / KERNEL_API_KEY from the environment, maps Harbor provider/name to cua provider:name, forwards provider keys from --ae, and spawns the bundled Node task.js. The entrypoint attaches to the existing Kernel session (browsers.retrieve), runs CuaAgentHarness, and writes /logs/agent artifacts: answer.txt, shots/, and run.jsonl. Python then maps run.jsonl to ATIF trajectory.json and fills AgentContext token/cost fields.

Also includes the cua-hello smoke task (example.com heading), docs, harbor[kernel] as a git dependency on the Kernel-env branch, wheel packaging of built node/dist, and Node/Python tests (subprocess mocked in pytest).

Reviewed by Cursor Bugbot for commit 455ac06. Bugbot is set up for automated code reviews on this repo. Configure here.

rgarcia and others added 2 commits June 26, 2026 14:58
Introduces @onkernel/cua-bench, an extensible web-agent benchmark runner that
drives the cua-agent loop against Kernel cloud browsers and grades trajectories
with a configurable LLM judge. Online-Mind2Web is the first benchmark, graded by
a ported WebJudge; a registry interface lets more benchmarks drop in. Wires the
package into the npm workspace, build chain, and tsconfig project references.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the standalone packages/bench with ./benchmarks: the reusable core
connecting Harbor's Kernel environment to the cua agent loop.

- benchmarks/src/cua_harbor: CuaHarborAgent (BaseAgent), loaded by Harbor via
  --agent-import-path; no harbor fork changes needed (import-path resolution).
- benchmarks/node: the cua-bench-task entrypoint that attaches to the Kernel
  session via browsers.retrieve, runs CuaAgentHarness, and emits the answer,
  screenshots, and an ATIF-mappable event log under /logs/agent.
- harbor[kernel] pulled as a uv git dependency; minimal example task + tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@rgarcia rgarcia changed the title Add packages/bench with an Online-Mind2Web benchmark runner cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/) Jun 26, 2026
@rgarcia rgarcia marked this pull request as ready for review June 27, 2026 13:36

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 3 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for all 3 issues found in the latest run.

  • ✅ Fixed: First prompt lacks screenshot
    • The Node benchmark entrypoint now captures an initial browser screenshot and passes it via harness.prompt(..., { images }) on the first turn.
  • ✅ Fixed: Node failure treated as success
    • CuaHarborAgent.run now raises a RuntimeError when the Node subprocess exits non-zero so failed runs are surfaced as failures instead of continuing.
  • ✅ Fixed: Ignores harness error stopReason
    • The Node benchmark entrypoint now checks assistant.stopReason after harness.prompt and throws on error or aborted to fail the run.

Create PR

Or push these changes by commenting:

@cursor push 90be33c080
Preview (90be33c080)
diff --git a/benchmarks/node/src/task.ts b/benchmarks/node/src/task.ts
--- a/benchmarks/node/src/task.ts
+++ b/benchmarks/node/src/task.ts
@@ -1,7 +1,7 @@
 import { writeFileSync } from "node:fs";
 import { join } from "node:path";
 import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent";
-import { type CuaModelRef, requireCuaEnvApiKeyForModel } from "@onkernel/cua-ai";
+import { type CuaModelRef, type ImageContent, requireCuaEnvApiKeyForModel } from "@onkernel/cua-ai";
 import Kernel from "@onkernel/sdk";
 import { extractFinalAnswer } from "./answer.ts";
 import { attachAtifSink, writeFinalLine, writeUserLine } from "./sink.ts";
@@ -34,7 +34,8 @@
   requireCuaEnvApiKeyForModel(model);
 
   const client = new Kernel({ apiKey: requireEnv("KERNEL_API_KEY") });
-  const browser = await client.browsers.retrieve(requireEnv("KERNEL_SESSION_ID"));
+  const kernelSessionId = requireEnv("KERNEL_SESSION_ID");
+  const browser = await client.browsers.retrieve(kernelSessionId);
   const session = await new InMemorySessionRepo().create({ id: taskId });
   const harness = new CuaAgentHarness({
     browser,
@@ -48,7 +49,11 @@
   writeUserLine(outDir, instruction);
   const unsubscribe = attachAtifSink({ harness, outDir });
   try {
-    await harness.prompt(instruction);
+    const images = await captureInitialScreenshot(client, kernelSessionId);
+    const assistant = await harness.prompt(instruction, images ? { images } : undefined);
+    if (assistant.stopReason === "error" || assistant.stopReason === "aborted") {
+      throw new Error(assistant.errorMessage ?? `agent stopped with ${assistant.stopReason}`);
+    }
   } finally {
     unsubscribe();
   }
@@ -59,6 +64,16 @@
   writeFinalLine(outDir, { answer, session_id: taskId, model, agent_version: CUA_AGENT_VERSION });
 }
 
+async function captureInitialScreenshot(client: Kernel, sessionId: string): Promise<ImageContent[] | undefined> {
+  try {
+    const screenshot = await client.browsers.computer.captureScreenshot(sessionId);
+    const image = Buffer.from(await screenshot.arrayBuffer()).toString("base64");
+    return [{ type: "image", data: image, mimeType: "image/png" }];
+  } catch {
+    return undefined;
+  }
+}
+
 main().catch((err) => {
   console.error(err instanceof Error ? err.stack ?? err.message : String(err));
   process.exit(1);

diff --git a/benchmarks/src/cua_harbor/agent.py b/benchmarks/src/cua_harbor/agent.py
--- a/benchmarks/src/cua_harbor/agent.py
+++ b/benchmarks/src/cua_harbor/agent.py
@@ -105,6 +105,9 @@
             self.logger.warning(
                 f"cua entrypoint exited with code {proc.returncode}; see {stderr_path}"
             )
+            raise RuntimeError(
+                f"cua entrypoint exited with code {proc.returncode}; see {stderr_path}"
+            )
 
         self._ensure_answer_file()
         # Populate context now so a later timeout still leaves token/cost metrics.

diff --git a/benchmarks/tests/test_agent.py b/benchmarks/tests/test_agent.py
--- a/benchmarks/tests/test_agent.py
+++ b/benchmarks/tests/test_agent.py
@@ -100,3 +100,21 @@
     agent = _make_agent(tmp_path)
     with pytest.raises(RuntimeError, match="Kernel session not started"):
         await agent.run("hi", EmptyEnv(), AgentContext())
+
+
+async def test_run_raises_when_node_entrypoint_fails(tmp_path, fake_env, monkeypatch):
+    class _FailedProc:
+        returncode = 17
+
+        async def wait(self) -> None:
+            return None
+
+    async def fake_exec(program, *args, env=None, stdout=None, stderr=None):
+        return _FailedProc()
+
+    monkeypatch.setattr("cua_harbor.agent.asyncio.create_subprocess_exec", fake_exec)
+    monkeypatch.setattr(CuaHarborAgent, "_bundle_path", lambda self: Path("/fake/task.js"))
+
+    agent = _make_agent(tmp_path)
+    with pytest.raises(RuntimeError, match="cua entrypoint exited with code 17"):
+        await agent.run("Go to example.com", fake_env, AgentContext())

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 455ac06. Configure here.

writeUserLine(outDir, instruction);
const unsubscribe = attachAtifSink({ harness, outDir });
try {
await harness.prompt(instruction);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First prompt lacks screenshot

Medium Severity

The Node entrypoint starts a fresh harness session with harness.prompt(instruction) and never passes { images }. For non-yutori providers the first turn is sent without the current browser frame, unlike the CLI’s maybeInitialScreenshot pattern, so models that rely on an initial screenshot (or observe-style prompts) can fail or waste a tool round even when Kernel already navigated to the start URL.

Fix in Cursor Fix in Web

Triggered by learned rule: Harness prompt calls must attach first-prompt screenshot for non-yutori providers

Reviewed by Cursor Bugbot for commit 455ac06. Configure here.


self._ensure_answer_file()
# Populate context now so a later timeout still leaves token/cost metrics.
self.populate_context_post_run(context)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node failure treated as success

Medium Severity

If the Node entrypoint exits non-zero, CuaHarborAgent.run logs a warning but continues execution without raising an exception. This allows _ensure_answer_file and populate_context_post_run to run, potentially leaving answer.txt empty or partial. This behavior can mask agent failures from Harbor, leading to misleading task outcomes and metrics.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 455ac06. Configure here.

const branch = await session.getBranch();
const answer = extractFinalAnswer(branch);
writeFileSync(join(outDir, "answer.txt"), answer);
writeFinalLine(outDir, { answer, session_id: taskId, model, agent_version: CUA_AGENT_VERSION });

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignores harness error stopReason

Medium Severity

After harness.prompt, the entrypoint never inspects the returned assistant message for stopReason of error or aborted. A failed agent turn can still write answer.txt, emit final in run.jsonl, and exit 0, so Harbor treats the agent phase as successful while grading may use an empty or misleading answer.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 455ac06. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant