Skip to content

[Harbor 4/4] architecture docs, tutorial, and the GAIA example#6

Open
varunursekar wants to merge 3 commits into
harbor-3-compilerfrom
harbor-4-docs
Open

[Harbor 4/4] architecture docs, tutorial, and the GAIA example#6
varunursekar wants to merge 3 commits into
harbor-3-compilerfrom
harbor-4-docs

Conversation

@varunursekar

@varunursekar varunursekar commented Jun 24, 2026

Copy link
Copy Markdown

Draft · Stack 4 of 4 — targets harbor-3-compiler. Additive, low-risk.

  • docs/harbor/architecture.md — what it is, the compiled-task topology, the two modes, the component map, and the leaderboard-integrity model.
  • docs/harbor/tutorial.md — build + run end to end (both modes, the agent-side protocol); README Harbor section.
  • examples/gaia-optimization — a Mode-B example optimizing a GaiaAgent (thin Terminus2 subclass with an editable prompt) on gaia/gaia via a nested harbor run on Modal.

Start your reading here for the big picture, then dive into [1/4]–[3/4].

Stack: [1/4] core → [2/4] sidecar → [3/4] compiler → this.

🤖 Generated with Claude Code

Greptile Summary

This PR adds the final layer of the Harbor integration stack: architecture and tutorial documentation plus a complete Mode B runnable example (gaia-optimization) that optimizes a GaiaAgent prompt on the GAIA benchmark via a nested harbor run on Modal.

  • Docs (docs/harbor/architecture.md, tutorial.md): cover the compiled-task topology, both evaluation modes, the leaderboard-integrity trust boundary (including the documented fail-open default for unlisted splits), and end-to-end build/run instructions for both modes.
  • GAIA example (examples/gaia-optimization): a self-contained build.yaml + thin Terminus2 subclass that redirects prompt-template resolution to an editable prompts/ directory — the optimization surface. The build.yaml correctly sets no_access on the held-out validation split and includes a clear caveat about held-out task IDs being readable from the git-tracked agent_repo (acceptable for this public benchmark, with guidance for private benchmarks).

Confidence Score: 5/5

Purely additive: documentation files and a self-contained example with no changes to library code or existing paths.

All changed files are new (docs and example); no existing code is modified. The one Python file (agent.py) is a minimal 38-line Terminus2 subclass with straightforward logic and no side effects on the rest of the codebase.

No files require special attention.

Important Files Changed

Filename Overview
vero/docs/harbor/architecture.md New architecture doc covering the compiled-task topology, both evaluation modes, the leaderboard-integrity trust boundary, and the component map; accurately documents the known fail-open default for unlisted splits.
vero/docs/harbor/tutorial.md New tutorial with install steps, Mode A / Mode B build.yaml examples, build/run commands, and the agent-side protocol; internally consistent with the architecture doc and example.
vero/examples/gaia-optimization/build.yaml Well-structured Mode B build config with correct split access levels, budget, and reward target; includes a clear caveat comment about held-out task IDs being visible in the git-tracked agent_repo.
vero/examples/gaia-optimization/src/gaia_agent/agent.py Minimal Terminus2 subclass redirecting prompt template resolution to the editable prompts/ directory; correctly falls back to super() for unknown parser names, and no hardcoded secrets.
vero/examples/gaia-optimization/pyproject.toml Clean project definition; force-include for prompts/ ensures the editable prompt files are packaged into the wheel alongside the Python module.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["vero harbor build -c build.yaml"] --> B["Harbor task dir\n(environment/, instruction.md, tests/test.sh)"]
    B --> C["harbor run -p task -a optimizer -e docker"]

    C --> D["main container\n(optimizer agent)\nedits prompts/, commits"]
    C --> E["eval-sidecar container\nvero harbor serve\n(budget ledger, admin token)"]

    D -- "vero harbor eval --split train" --> E
    E -- "nested harbor run (Modal)" --> F["GaiaAgent runs GAIA tasks\n(inner harbor environment)"]
    F -- "verifier rewards collate" --> E
    E -- "aggregate score only\n(no per-sample labels)" --> D

    D -- "trial ends" --> G["tests/test.sh\n(shared verifier, root)"]
    G -- "vero harbor finalize\n(admin token, root:600)" --> E
    E --> H["Select best train commit\nScore on hidden validation split"]
    H --> I["reward.json\naccuracy on validation"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["vero harbor build -c build.yaml"] --> B["Harbor task dir\n(environment/, instruction.md, tests/test.sh)"]
    B --> C["harbor run -p task -a optimizer -e docker"]

    C --> D["main container\n(optimizer agent)\nedits prompts/, commits"]
    C --> E["eval-sidecar container\nvero harbor serve\n(budget ledger, admin token)"]

    D -- "vero harbor eval --split train" --> E
    E -- "nested harbor run (Modal)" --> F["GaiaAgent runs GAIA tasks\n(inner harbor environment)"]
    F -- "verifier rewards collate" --> E
    E -- "aggregate score only\n(no per-sample labels)" --> D

    D -- "trial ends" --> G["tests/test.sh\n(shared verifier, root)"]
    G -- "vero harbor finalize\n(admin token, root:600)" --> E
    E --> H["Select best train commit\nScore on hidden validation split"]
    H --> I["reward.json\naccuracy on validation"]
Loading

Reviews (2): Last reviewed commit: "Merge pull request #10 from scaleapi/har..." | Re-trigger Greptile

- docs/harbor/architecture.md — what the integration is, the compiled-task topology,
  the two evaluation modes, the component map, and the leaderboard-integrity model.
- docs/harbor/tutorial.md — build and run an optimization task end to end (both modes,
  the agent-side protocol), and a Harbor section in the README.
- examples/gaia-optimization — a Mode-B example optimizing a GaiaAgent (a thin Terminus2
  subclass with an editable prompt) on gaia/gaia via a nested harbor run on Modal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@varunursekar varunursekar requested a review from a team June 24, 2026 18:18
@varunursekar varunursekar marked this pull request as ready for review June 24, 2026 18:22
Comment thread vero/examples/gaia-optimization/src/gaia_agent/agent.py
Comment thread vero/docs/harbor/architecture.md Outdated

This lets anyone optimize a coding agent with plain `harbor run`, and makes the result
leaderboard-gradeable — the optimizer cannot read hidden labels, modify the scorer, or
bypass its budget.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads as a hard guarantee ('the optimizer cannot read hidden labels, modify the scorer, or bypass its budget'), but the code makes each best-effort and the shipped GAIA example undercuts the first one (see the build.yaml comment). Suggest softening to something like: 'vero never writes per-sample labels to the agent's volume and meters every agent evaluation; OS-level mechanisms (read-only paths, a root:600 finalize token) keep the scorer and test split out of the agent's reach on a best-effort basis.'


splits:
- { split: train, access: non_viewable } # optimizer sees aggregate scores only
- { split: validation, access: no_access } # hidden; never reaches the optimizer

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'hidden; never reaches the optimizer' isn't true for this config: build.yaml is git-tracked and agent_repo is ., so vero harbor build seeds this whole file — including these validation task IDs — into /work/agent via git archive HEAD. The optimizer can read the held-out task IDs, and GAIA answers are public. Move the partition out of the agent_repo subtree, or caveat that for public benchmarks the held-out identity is visible (only per-sample scores are withheld). This is the example that backs the headline 'cannot read hidden labels' claim, so it is worth getting airtight.

Comment thread vero/README.md Outdated

- [`docs/harbor/architecture.md`](docs/harbor/architecture.md) — what it is, the topology, and the leaderboard-integrity model.
- [`docs/harbor/tutorial.md`](docs/harbor/tutorial.md) — build and run a task end to end.
- [`examples/gsm8k-agent`](examples/gsm8k-agent) (Mode A) and [`examples/gaia-optimization`](examples/gaia-optimization) (Mode B).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

examples/gsm8k-agent is cited as the Mode A example but it has no build.yaml (it's the older Policy-API example). The Harbor Mode A example that ships a build.yaml is examples/doubler-agent. Repoint here, or add a build.yaml to gsm8k-agent.

The optimizer is untrusted. Integrity rests on a few mechanisms, all best-effort at
the OS/process level (a container escape is out of scope):

- **3-tier split visibility** (`SplitAccessLevel`): `visible` (aggregate + per-sample

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth one explicit line here: tier_for_split defaults any split not listed to viewable (full per-sample results), so omission fails open. Tell authors to list every split explicitly. (Pairs with the protocol.py fail-open comment on #4.)

Comment thread vero/docs/harbor/architecture.md Outdated
- **Commit transfer**: the sidecar `git fetch`es the agent's commit from the mounted
repo into its *own* repo with hooks disabled and `file://` (object copy, no
alternates), so the evaluated tree is fully owned by the sidecar and tamper-evident.
- **Protected scorer / write-access**: the scorer is sidecar-only; `read_only_paths`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'the scorer is sidecar-only' holds for Mode B but not Mode A, where the scorer lives in the agent's editable repo, protected only by chown root:root + chmod -R a-w on read_only_paths (which isn't a real tamper control — see #5). Recommend splitting this claim by mode.

…xample, by-mode scorer

Documentation accuracy fixes (review findings on PR #6):

- architecture: soften the intro from a hard guarantee ("the optimizer cannot
  read hidden labels, modify the scorer, or bypass its budget") to best-effort,
  OS/process-level language describing what is actually enforced.
- gaia build.yaml: correct "never reaches the optimizer". Because agent_repo is
  "." and build.yaml is git-tracked, the validation task ids ARE seeded into the
  optimizer's repo; only the per-sample scores are withheld. Acceptable for a
  public benchmark, with a caveat + mitigations for secret-identity benchmarks.
- examples: gsm8k-agent is cited as the Mode A example but ships no build.yaml;
  repoint to gaia-optimization as the complete runnable example and pair
  gsm8k-agent with the tutorial's Mode A snippet.
- architecture: document the current fail-open default for unlisted splits (and
  that it becomes fail-closed once the protocol fix lands), and split the
  "scorer is sidecar-only" claim by mode (true for Mode B; Mode A keeps the
  scorer in the agent's editable repo until the serve.py fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
docs(harbor): honest integrity guarantees, GAIA leak caveat, Mode A example [fixes 4/4 docs]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants