WebVoyager Harbor adapter (Kernel env + cua agent)#42
Draft
rgarcia wants to merge 6 commits into
Draft
Conversation
Generates WebVoyager's 643 live-web tasks (15 sites) as Harbor task dirs that run on the Kernel environment via the shared cua_harbor agent. Each record becomes instruction.md + environment/kernel.json (start_url + stealth + 1280x1024) + a per-task ground_truth.json; the dataset is vendored and pinned to upstream commit 0915445 for hermetic generation. The verifier ports WebVoyager's single multimodal judge (SYSTEM_PROMPT verbatim from upstream auto_eval.py) to the Anthropic Messages API: it reads /logs/agent/answer.txt + the last-k /logs/agent/shots/shot-<n>.png the agent spilled and writes a 0/1 reward (SUCCESS/NOT SUCCESS, ambiguous fails closed). Site names with spaces are slugified so [task].name matches ORG_NAME_PATTERN, and reference answers with stray control chars are escaped for valid TOML. Generated task dirs and caches are gitignored. Mocked unit tests + ruff green.
The Kernel verifier VM has Python 3 but no pip/ensurepip, so the judge cannot install the anthropic SDK at grade time. Call the Messages API directly with urllib.request instead; drop the install step from test.sh and point docs at bare python3 for generation. Also gitignore _smoke_logs/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Run the WebVoyager adapter end to end on Kernel browsers with cua as the agent and the Anthropic WebJudge as the verifier: 20 tasks, pass rate 10/20 (10/17 of graded tasks), 3 agent timeouts on heavy/anti-bot sites, no adapter bugs. SMOKE.md captures the per-task table and the observed failure taxonomy. Make the judge resilient across model generations and transient API failures: retry once without `temperature` when a model rejects it with a 400 (newer models do), and fail closed to reward 0 with the error recorded in grading_details.json instead of crashing a trial into a missing reward.
The mid-run snapshot under-counted exceptions; the final summary is 5 (4 AgentTimeoutError + 1 AddTestsDirError). Headline Mean 0.500 (10/20) unchanged.
Replace the SMOKE notes with the claude-opus-4-8 agent + opus-4-8 judge run: 14/20 pass over 20 curated tasks across 12 sites, 0 judge/adapter exceptions. Failure taxonomy: 1 anti-bot (Cloudflare), 3 screenshot-coverage false-negatives (the MAX_IMAGES tension), 1 agent timeout (multi-constraint faceted search), and 1 env/session-lifetime error (session deleted before the shared-session verifier could attach). The judge hardening this run validated (temperature-drop retry + fail-closed on HTTP error) is already on the branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… auto-eval The canonical WebVoyager auto-eval invocation (evaluation/run_eval.sh + README) runs the GPT-4V judge with --max_attached_imgs 15; our default of 3 was read from auto_eval.py's argparse default, which is never what produces the published numbers. With one screenshot spilled per agent step, the last-k window is the only place the deciding frame can land, so k=3 left correct answers unverifiable and produced screenshot-coverage false-negatives. Set the default to 15 in task.toml and webjudge.py (env override preserved) and fix the README/run-config notes that quoted the old default. A live re-run at k=15 recovers the SMOKE false-negatives (apple--2, huggingface--2 both 0 -> 1). Adds PARITY.md documenting the applied fix vs the deliberate Kernel adaptations left intact. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Parity pass vs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the WebVoyager benchmark as a Harbor adapter that runs on the Kernel
environment with cua as the agent. Stacks on the cua-harbor shared core (#40).
benchmarks/adapters/webvoyager/— Python task generator (adapter.py+main.py) that emits Harbor task dirs from the vendored WebVoyager dataset(643 tasks, pinned under
src/webvoyager/data/), plus a stdlib-only verifier(
webjudge.py) that ports WebVoyager's single multimodal judge to theAnthropic Messages API (no pip needed in the verifier VM).
retries once without
temperaturewhen a model rejects it with a 400, andfails closed to reward 0 with the error recorded in
grading_details.jsonrather than crashing a trial into a missing reward.
Live smoke (recorded in
SMOKE.md)Ran the full pipeline live — 20 tasks across 13 sites,
-n 8with browserpools,
claude-sonnet-4-6for both the agent and the judge, 900s/task:AgentTimeoutErroron heavy/anti-bot sites (Amazon, Apple, Booking,Allrecipes). No adapter bugs — browser provision, agent drive,
answer/screenshot spill, in-VM judge, and reward write all fired on every
task.
budget + a residential proxy for a parity run), and judge strictness on
textually-correct but visually-unconfirmed answers.
Test plan
uv run pytest adapters/webvoyager/tests— 25 passed (generation +judge parse/retry/error-handling, network stubbed)
uv run ruff checkclean🤖 Generated with Claude Code