cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/)#40
cua-harbor shared core: Kernel Harbor env <-> cua agent (benchmarks/)#40rgarcia wants to merge 2 commits into
Conversation
Introduces @onkernel/cua-bench, an extensible web-agent benchmark runner that drives the cua-agent loop against Kernel cloud browsers and grades trajectories with a configurable LLM judge. Online-Mind2Web is the first benchmark, graded by a ported WebJudge; a registry interface lets more benchmarks drop in. Wires the package into the npm workspace, build chain, and tsconfig project references. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the standalone packages/bench with ./benchmarks: the reusable core connecting Harbor's Kernel environment to the cua agent loop. - benchmarks/src/cua_harbor: CuaHarborAgent (BaseAgent), loaded by Harbor via --agent-import-path; no harbor fork changes needed (import-path resolution). - benchmarks/node: the cua-bench-task entrypoint that attaches to the Kernel session via browsers.retrieve, runs CuaAgentHarness, and emits the answer, screenshots, and an ATIF-mappable event log under /logs/agent. - harbor[kernel] pulled as a uv git dependency; minimal example task + tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 3 potential issues.
Bugbot Autofix prepared fixes for all 3 issues found in the latest run.
- ✅ Fixed: First prompt lacks screenshot
- The Node benchmark entrypoint now captures an initial browser screenshot and passes it via
harness.prompt(..., { images })on the first turn.
- The Node benchmark entrypoint now captures an initial browser screenshot and passes it via
- ✅ Fixed: Node failure treated as success
CuaHarborAgent.runnow raises aRuntimeErrorwhen the Node subprocess exits non-zero so failed runs are surfaced as failures instead of continuing.
- ✅ Fixed: Ignores harness error stopReason
- The Node benchmark entrypoint now checks
assistant.stopReasonafterharness.promptand throws onerrororabortedto fail the run.
- The Node benchmark entrypoint now checks
Or push these changes by commenting:
@cursor push 90be33c080
Preview (90be33c080)
diff --git a/benchmarks/node/src/task.ts b/benchmarks/node/src/task.ts
--- a/benchmarks/node/src/task.ts
+++ b/benchmarks/node/src/task.ts
@@ -1,7 +1,7 @@
import { writeFileSync } from "node:fs";
import { join } from "node:path";
import { CuaAgentHarness, InMemorySessionRepo, NodeExecutionEnv } from "@onkernel/cua-agent";
-import { type CuaModelRef, requireCuaEnvApiKeyForModel } from "@onkernel/cua-ai";
+import { type CuaModelRef, type ImageContent, requireCuaEnvApiKeyForModel } from "@onkernel/cua-ai";
import Kernel from "@onkernel/sdk";
import { extractFinalAnswer } from "./answer.ts";
import { attachAtifSink, writeFinalLine, writeUserLine } from "./sink.ts";
@@ -34,7 +34,8 @@
requireCuaEnvApiKeyForModel(model);
const client = new Kernel({ apiKey: requireEnv("KERNEL_API_KEY") });
- const browser = await client.browsers.retrieve(requireEnv("KERNEL_SESSION_ID"));
+ const kernelSessionId = requireEnv("KERNEL_SESSION_ID");
+ const browser = await client.browsers.retrieve(kernelSessionId);
const session = await new InMemorySessionRepo().create({ id: taskId });
const harness = new CuaAgentHarness({
browser,
@@ -48,7 +49,11 @@
writeUserLine(outDir, instruction);
const unsubscribe = attachAtifSink({ harness, outDir });
try {
- await harness.prompt(instruction);
+ const images = await captureInitialScreenshot(client, kernelSessionId);
+ const assistant = await harness.prompt(instruction, images ? { images } : undefined);
+ if (assistant.stopReason === "error" || assistant.stopReason === "aborted") {
+ throw new Error(assistant.errorMessage ?? `agent stopped with ${assistant.stopReason}`);
+ }
} finally {
unsubscribe();
}
@@ -59,6 +64,16 @@
writeFinalLine(outDir, { answer, session_id: taskId, model, agent_version: CUA_AGENT_VERSION });
}
+async function captureInitialScreenshot(client: Kernel, sessionId: string): Promise<ImageContent[] | undefined> {
+ try {
+ const screenshot = await client.browsers.computer.captureScreenshot(sessionId);
+ const image = Buffer.from(await screenshot.arrayBuffer()).toString("base64");
+ return [{ type: "image", data: image, mimeType: "image/png" }];
+ } catch {
+ return undefined;
+ }
+}
+
main().catch((err) => {
console.error(err instanceof Error ? err.stack ?? err.message : String(err));
process.exit(1);
diff --git a/benchmarks/src/cua_harbor/agent.py b/benchmarks/src/cua_harbor/agent.py
--- a/benchmarks/src/cua_harbor/agent.py
+++ b/benchmarks/src/cua_harbor/agent.py
@@ -105,6 +105,9 @@
self.logger.warning(
f"cua entrypoint exited with code {proc.returncode}; see {stderr_path}"
)
+ raise RuntimeError(
+ f"cua entrypoint exited with code {proc.returncode}; see {stderr_path}"
+ )
self._ensure_answer_file()
# Populate context now so a later timeout still leaves token/cost metrics.
diff --git a/benchmarks/tests/test_agent.py b/benchmarks/tests/test_agent.py
--- a/benchmarks/tests/test_agent.py
+++ b/benchmarks/tests/test_agent.py
@@ -100,3 +100,21 @@
agent = _make_agent(tmp_path)
with pytest.raises(RuntimeError, match="Kernel session not started"):
await agent.run("hi", EmptyEnv(), AgentContext())
+
+
+async def test_run_raises_when_node_entrypoint_fails(tmp_path, fake_env, monkeypatch):
+ class _FailedProc:
+ returncode = 17
+
+ async def wait(self) -> None:
+ return None
+
+ async def fake_exec(program, *args, env=None, stdout=None, stderr=None):
+ return _FailedProc()
+
+ monkeypatch.setattr("cua_harbor.agent.asyncio.create_subprocess_exec", fake_exec)
+ monkeypatch.setattr(CuaHarborAgent, "_bundle_path", lambda self: Path("/fake/task.js"))
+
+ agent = _make_agent(tmp_path)
+ with pytest.raises(RuntimeError, match="cua entrypoint exited with code 17"):
+ await agent.run("Go to example.com", fake_env, AgentContext())You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 455ac06. Configure here.
| writeUserLine(outDir, instruction); | ||
| const unsubscribe = attachAtifSink({ harness, outDir }); | ||
| try { | ||
| await harness.prompt(instruction); |
There was a problem hiding this comment.
First prompt lacks screenshot
Medium Severity
The Node entrypoint starts a fresh harness session with harness.prompt(instruction) and never passes { images }. For non-yutori providers the first turn is sent without the current browser frame, unlike the CLI’s maybeInitialScreenshot pattern, so models that rely on an initial screenshot (or observe-style prompts) can fail or waste a tool round even when Kernel already navigated to the start URL.
Triggered by learned rule: Harness prompt calls must attach first-prompt screenshot for non-yutori providers
Reviewed by Cursor Bugbot for commit 455ac06. Configure here.
|
|
||
| self._ensure_answer_file() | ||
| # Populate context now so a later timeout still leaves token/cost metrics. | ||
| self.populate_context_post_run(context) |
There was a problem hiding this comment.
Node failure treated as success
Medium Severity
If the Node entrypoint exits non-zero, CuaHarborAgent.run logs a warning but continues execution without raising an exception. This allows _ensure_answer_file and populate_context_post_run to run, potentially leaving answer.txt empty or partial. This behavior can mask agent failures from Harbor, leading to misleading task outcomes and metrics.
Reviewed by Cursor Bugbot for commit 455ac06. Configure here.
| const branch = await session.getBranch(); | ||
| const answer = extractFinalAnswer(branch); | ||
| writeFileSync(join(outDir, "answer.txt"), answer); | ||
| writeFinalLine(outDir, { answer, session_id: taskId, model, agent_version: CUA_AGENT_VERSION }); |
There was a problem hiding this comment.
Ignores harness error stopReason
Medium Severity
After harness.prompt, the entrypoint never inspects the returned assistant message for stopReason of error or aborted. A failed agent turn can still write answer.txt, emit final in run.jsonl, and exit 0, so Harbor treats the agent phase as successful while grading may use an empty or misleading answer.
Reviewed by Cursor Bugbot for commit 455ac06. Configure here.



Summary
Repurposes this PR from the standalone Online-Mind2Web runner to the reusable shared core that connects Harbor's Kernel environment to the cua agent loop. Everything lives in
kernel/cuaunder a new top-level./benchmarks/;kernel/harboris not modified — the agent is loaded by import path.What's here
benchmarks/src/cua_harbor/—CuaHarborAgent(BaseAgent), loaded via--agent-import-path cua_harbor:CuaHarborAgent(Harbor'sAgentFactory.create_agent_from_import_path, so no enum/factory/fork edit). It spawns the Node entrypoint on the host, plumbs the Kernel session id/key + model + provider key, writes/logs/agent/answer.txt, and maps the run log to an ATIF trajectory.benchmarks/node/— thecua-bench-taskentrypoint: attaches to the Kernel session viaclient.browsers.retrieve(KERNEL_SESSION_ID)(never creates/deletes it), runsCuaAgentHarness, and emits the answer + per-step screenshots + an ATIF-mappablerun.jsonl. Depends on the published@onkernel/*packages.harbor[kernel]pulled as a git dependency from the fork branch that carries the Kernel env (hypeship/kernel-environment).examples/tasks/cua-hello) + Node/Python tests.Design
kernel/harbor(a candidate for upstream contribution), used at runtime via-e kernel. Only the Kernel-specific agent + (later) benchmark adapters live here.Verification
benchmarks/node: build +tsctypecheck + vitest green.benchmarks:uv sync(resolvesharbor[kernel]from the git fork branch) +ruff+pytestgreen (Python tests mock the Node subprocess).uv run harbor run -p examples/tasks/cua-hello -e kernel --agent-import-path cua_harbor:CuaHarborAgent -m anthropic/claude-opus-4-8→ reward 1.0, 0 exceptions (real Kernel browser, opus-4-8 drives example.com, verifier checks the answer). The live run caught and fixed a missingharbor[kernel]extra that the mocked tests didn't.Scope / next
Shared core only. The three benchmark adapters (Online-Mind2Web, WebVoyager, ClawBench) land as later branches off this one, each adding
benchmarks/adapters/<name>/(a Python task generator + verifier) on top of this core.Note
Medium Risk
New benchmark integration that forwards API keys and Kernel session credentials into a host Node subprocess and pins Harbor to a git fork branch; scope is isolated to
benchmarks/but runtime depends on external services and keys.Overview
Adds a new
benchmarks/package so the cua computer-use agent can run as a Harbor agent on the Kernel browser environment, without changing Harbor itself—load via--agent-import-path cua_harbor:CuaHarborAgent.CuaHarborAgentruns on the host: it readsKERNEL_SESSION_ID/KERNEL_API_KEYfrom the environment, maps Harborprovider/nameto cuaprovider:name, forwards provider keys from--ae, and spawns the bundled Nodetask.js. The entrypoint attaches to the existing Kernel session (browsers.retrieve), runsCuaAgentHarness, and writes/logs/agentartifacts:answer.txt,shots/, andrun.jsonl. Python then mapsrun.jsonlto ATIFtrajectory.jsonand fillsAgentContexttoken/cost fields.Also includes the
cua-hellosmoke task (example.com heading), docs,harbor[kernel]as a git dependency on the Kernel-env branch, wheel packaging of builtnode/dist, and Node/Python tests (subprocess mocked in pytest).Reviewed by Cursor Bugbot for commit 455ac06. Bugbot is set up for automated code reviews on this repo. Configure here.