[Harbor 4/4] architecture docs, tutorial, and the GAIA example by varunursekar · Pull Request #6 · scaleapi/vero

varunursekar · 2026-06-24T18:12:47Z

Draft · Stack 4 of 4 — targets harbor-3-compiler. Additive, low-risk.

docs/harbor/architecture.md — what it is, the compiled-task topology, the two modes, the component map, and the leaderboard-integrity model.
docs/harbor/tutorial.md — build + run end to end (both modes, the agent-side protocol); README Harbor section.
examples/gaia-optimization — a Mode-B example optimizing a GaiaAgent (thin Terminus2 subclass with an editable prompt) on gaia/gaia via a nested harbor run on Modal.

Start your reading here for the big picture, then dive into [1/4]–[3/4].

Stack: [1/4] core → [2/4] sidecar → [3/4] compiler → this.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the final layer of the Harbor integration stack: architecture and tutorial documentation plus a complete Mode B runnable example (gaia-optimization) that optimizes a GaiaAgent prompt on the GAIA benchmark via a nested harbor run on Modal.

Docs (docs/harbor/architecture.md, tutorial.md): cover the compiled-task topology, both evaluation modes, the leaderboard-integrity trust boundary (including the documented fail-open default for unlisted splits), and end-to-end build/run instructions for both modes.
GAIA example (examples/gaia-optimization): a self-contained build.yaml + thin Terminus2 subclass that redirects prompt-template resolution to an editable prompts/ directory — the optimization surface. The build.yaml correctly sets no_access on the held-out validation split and includes a clear caveat about held-out task IDs being readable from the git-tracked agent_repo (acceptable for this public benchmark, with guidance for private benchmarks).

Confidence Score: 5/5

Purely additive: documentation files and a self-contained example with no changes to library code or existing paths.

All changed files are new (docs and example); no existing code is modified. The one Python file (agent.py) is a minimal 38-line Terminus2 subclass with straightforward logic and no side effects on the rest of the codebase.

No files require special attention.

Important Files Changed

Filename	Overview
vero/docs/harbor/architecture.md	New architecture doc covering the compiled-task topology, both evaluation modes, the leaderboard-integrity trust boundary, and the component map; accurately documents the known fail-open default for unlisted splits.
vero/docs/harbor/tutorial.md	New tutorial with install steps, Mode A / Mode B build.yaml examples, build/run commands, and the agent-side protocol; internally consistent with the architecture doc and example.
vero/examples/gaia-optimization/build.yaml	Well-structured Mode B build config with correct split access levels, budget, and reward target; includes a clear caveat comment about held-out task IDs being visible in the git-tracked agent_repo.
vero/examples/gaia-optimization/src/gaia_agent/agent.py	Minimal Terminus2 subclass redirecting prompt template resolution to the editable prompts/ directory; correctly falls back to super() for unknown parser names, and no hardcoded secrets.
vero/examples/gaia-optimization/pyproject.toml	Clean project definition; force-include for prompts/ ensures the editable prompt files are packaged into the wheel alongside the Python module.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["vero harbor build -c build.yaml"] --> B["Harbor task dir\n(environment/, instruction.md, tests/test.sh)"]
    B --> C["harbor run -p task -a optimizer -e docker"]

    C --> D["main container\n(optimizer agent)\nedits prompts/, commits"]
    C --> E["eval-sidecar container\nvero harbor serve\n(budget ledger, admin token)"]

    D -- "vero harbor eval --split train" --> E
    E -- "nested harbor run (Modal)" --> F["GaiaAgent runs GAIA tasks\n(inner harbor environment)"]
    F -- "verifier rewards collate" --> E
    E -- "aggregate score only\n(no per-sample labels)" --> D

    D -- "trial ends" --> G["tests/test.sh\n(shared verifier, root)"]
    G -- "vero harbor finalize\n(admin token, root:600)" --> E
    E --> H["Select best train commit\nScore on hidden validation split"]
    H --> I["reward.json\naccuracy on validation"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["vero harbor build -c build.yaml"] --> B["Harbor task dir\n(environment/, instruction.md, tests/test.sh)"]
    B --> C["harbor run -p task -a optimizer -e docker"]

    C --> D["main container\n(optimizer agent)\nedits prompts/, commits"]
    C --> E["eval-sidecar container\nvero harbor serve\n(budget ledger, admin token)"]

    D -- "vero harbor eval --split train" --> E
    E -- "nested harbor run (Modal)" --> F["GaiaAgent runs GAIA tasks\n(inner harbor environment)"]
    F -- "verifier rewards collate" --> E
    E -- "aggregate score only\n(no per-sample labels)" --> D

    D -- "trial ends" --> G["tests/test.sh\n(shared verifier, root)"]
    G -- "vero harbor finalize\n(admin token, root:600)" --> E
    E --> H["Select best train commit\nScore on hidden validation split"]
    H --> I["reward.json\naccuracy on validation"]

_{Reviews (2): Last reviewed commit: "Merge pull request #10 from scaleapi/har..." | Re-trigger Greptile}

- docs/harbor/architecture.md — what the integration is, the compiled-task topology, the two evaluation modes, the component map, and the leaderboard-integrity model. - docs/harbor/tutorial.md — build and run an optimization task end to end (both modes, the agent-side protocol), and a Harbor section in the README. - examples/gaia-optimization — a Mode-B example optimizing a GaiaAgent (a thin Terminus2 subclass with an editable prompt) on gaia/gaia via a nested harbor run on Modal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

shehabyasser-scale · 2026-06-30T08:34:44Z

+
+This lets anyone optimize a coding agent with plain `harbor run`, and makes the result
+leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or
+bypass its budget.


This reads as a hard guarantee ('the optimizer cannot read hidden labels, modify the scorer, or bypass its budget'), but the code makes each best-effort and the shipped GAIA example undercuts the first one (see the build.yaml comment). Suggest softening to something like: 'vero never writes per-sample labels to the agent's volume and meters every agent evaluation; OS-level mechanisms (read-only paths, a root:600 finalize token) keep the scorer and test split out of the agent's reach on a best-effort basis.'

shehabyasser-scale · 2026-06-30T08:34:51Z

+
+splits:
+  - { split: train, access: non_viewable }   # optimizer sees aggregate scores only
+  - { split: validation, access: no_access }  # hidden; never reaches the optimizer


'hidden; never reaches the optimizer' isn't true for this config: build.yaml is git-tracked and agent_repo is ., so vero harbor build seeds this whole file — including these validation task IDs — into /work/agent via git archive HEAD. The optimizer can read the held-out task IDs, and GAIA answers are public. Move the partition out of the agent_repo subtree, or caveat that for public benchmarks the held-out identity is visible (only per-sample scores are withheld). This is the example that backs the headline 'cannot read hidden labels' claim, so it is worth getting airtight.

shehabyasser-scale · 2026-06-30T08:35:00Z

+
+- [`docs/harbor/architecture.md`](docs/harbor/architecture.md) — what it is, the topology, and the leaderboard-integrity model.
+- [`docs/harbor/tutorial.md`](docs/harbor/tutorial.md) — build and run a task end to end.
+- [`examples/gsm8k-agent`](examples/gsm8k-agent) (Mode A) and [`examples/gaia-optimization`](examples/gaia-optimization) (Mode B).


examples/gsm8k-agent is cited as the Mode A example but it has no build.yaml (it's the older Policy-API example). The Harbor Mode A example that ships a build.yaml is examples/doubler-agent. Repoint here, or add a build.yaml to gsm8k-agent.

shehabyasser-scale · 2026-06-30T08:35:06Z

+The optimizer is untrusted. Integrity rests on a few mechanisms, all best-effort at
+the OS/process level (a container escape is out of scope):
+
+- **3-tier split visibility** (`SplitAccessLevel`): `visible` (aggregate + per-sample


Worth one explicit line here: tier_for_split defaults any split not listed to viewable (full per-sample results), so omission fails open. Tell authors to list every split explicitly. (Pairs with the protocol.py fail-open comment on #4.)

shehabyasser-scale · 2026-06-30T08:35:12Z

+- **Commit transfer**: the sidecar `git fetch`es the agent's commit from the mounted
+  repo into its *own* repo with hooks disabled and `file://` (object copy, no
+  alternates), so the evaluated tree is fully owned by the sidecar and tamper-evident.
+- **Protected scorer / write-access**: the scorer is sidecar-only; `read_only_paths`


'the scorer is sidecar-only' holds for Mode B but not Mode A, where the scorer lives in the agent's editable repo, protected only by chown root:root + chmod -R a-w on read_only_paths (which isn't a real tamper control — see #5). Recommend splitting this claim by mode.

…xample, by-mode scorer Documentation accuracy fixes (review findings on PR #6): - architecture: soften the intro from a hard guarantee ("the optimizer cannot read hidden labels, modify the scorer, or bypass its budget") to best-effort, OS/process-level language describing what is actually enforced. - gaia build.yaml: correct "never reaches the optimizer". Because agent_repo is "." and build.yaml is git-tracked, the validation task ids ARE seeded into the optimizer's repo; only the per-sample scores are withheld. Acceptable for a public benchmark, with a caveat + mitigations for secret-identity benchmarks. - examples: gsm8k-agent is cited as the Mode A example but ships no build.yaml; repoint to gaia-optimization as the complete runnable example and pair gsm8k-agent with the tutorial's Mode A snippet. - architecture: document the current fail-open default for unlisted splits (and that it becomes fail-closed once the protocol fix lands), and split the "scorer is sidecar-only" claim by mode (true for Mode B; Mode A keeps the scorer in the agent's editable repo until the serve.py fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(harbor): honest integrity guarantees, GAIA leak caveat, Mode A example [fixes 4/4 docs]

varunursekar mentioned this pull request Jun 24, 2026

Add Harbor integration: optimization-as-a-Harbor-task #2

Closed

varunursekar requested a review from a team June 24, 2026 18:18

varunursekar marked this pull request as ready for review June 24, 2026 18:22

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread vero/examples/gaia-optimization/src/gaia_agent/agent.py

shehabyasser-scale reviewed Jun 30, 2026

View reviewed changes

shehabyasser-scale mentioned this pull request Jun 30, 2026

docs(harbor): honest integrity guarantees, GAIA leak caveat, Mode A example [fixes 4/4 docs] #10

Merged

Merge pull request #10 from scaleapi/harbor-4-docs-fixes

bb04d67

docs(harbor): honest integrity guarantees, GAIA leak caveat, Mode A example [fixes 4/4 docs]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Harbor 4/4] architecture docs, tutorial, and the GAIA example#6

[Harbor 4/4] architecture docs, tutorial, and the GAIA example#6
varunursekar wants to merge 3 commits into
harbor-3-compilerfrom
harbor-4-docs

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

shehabyasser-scale Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

varunursekar commented Jun 24, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

shehabyasser-scale Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varunursekar commented Jun 24, 2026 •

edited by greptile-apps Bot

Loading