Skip to content

[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5

Open
varunursekar wants to merge 3 commits into
harbor-2-sidecarfrom
harbor-3-compiler
Open

[Harbor 3/4] Mode B (nested harbor run) + the vero harbor build compiler#5
varunursekar wants to merge 3 commits into
harbor-2-sidecarfrom
harbor-3-compiler

Conversation

@varunursekar

@varunursekar varunursekar commented Jun 24, 2026

Copy link
Copy Markdown

Draft · Stack 3 of 4 — targets harbor-2-sidecar.

  • Mode B (runner.py): HarborRunner, an EvalStrategy that — per candidate — runs a nested harbor run of the agent over the selected Harbor tasks (e.g. on Modal) and collates the verifier rewards into vero SampleResults. One Harbor task = one sample; inference is delegated, scoring comes from Harbor.
  • The compiler (build/): vero harbor build renders a BuildConfig into a runnable Harbor task — a Docker Compose env (optimizer workbench + eval sidecar + 3 volumes), two Dockerfiles, instruction.md, tests/test.sh, seed/solve scripts — baking the dataset/scorer/baseline repo and the sidecar's ServeConfig. Mode A and Mode B.

Review focus: the nested-run + collation, and the trust-boundary plumbing baked into the generated compose (volumes, root:600 token, secrets → sidecar only). Also un-ignores src/vero/harbor/build/ (the repo's build/ rule was hiding the package).

Stack: [1/4] core → [2/4] sidecar → this → [4/4] docs.

🤖 Generated with Claude Code

Greptile Summary

This PR implements Mode B (HarborRunner) — a nested harbor run eval strategy that collates trial result.json files into SampleResults — and the vero harbor build compiler that renders a BuildConfig into a runnable Harbor task directory (Docker Compose environment, two Dockerfiles, templates, and baked dataset/baseline/vero source). The stack-2 issues flagged in previous reviews (git archive exit-code, mkdtemp leak, last-write-wins on retries, and silent n_attempts/max_retries drop) are all addressed in this revision.

  • HarborRunner (runner.py): nested harbor run execution with resume logic, best-trial selection via _trial_rank (clean+rewards > any rewards > latest timestamp), and gap-filling collation that converts missing trials to error SampleResults.
  • Compiler (build/compiler.py + templates): compile_task materialises BuildConfig into a deterministic task directory; trust-boundary wiring (secrets → sidecar only via ${VAR:?} compose guards, root:600 admin token, advisory read_only_paths in seed.sh) is clearly documented and tested.

Confidence Score: 5/5

Safe to merge; all previously-flagged defects are resolved and the remaining findings are minor edge-case gaps that do not affect the normal build or evaluation path.

All previously-flagged defects (git archive exit-code check, temp-dir leak, last-write-wins on retried tasks, silent n_attempts/max_retries drop) are resolved and covered by new tests in TestReviewFixes. The remaining findings are a missing Mode A task validation that fails loudly at sidecar startup, a TOML single-quote edge case requiring deliberately unusual formatting, and a shell-quoting gap in seed.sh.j2 unlikely with realistic file paths. None affect the happy path.

vero/src/vero/harbor/build/compiler.py for the Mode A task validation gap and the single-quote TOML regex; vero/src/vero/harbor/build/templates/seed.sh.j2 for the read_only_paths quoting.

Important Files Changed

Filename Overview
vero/src/vero/harbor/runner.py Mode B runner: _build_command now includes --n-attempts/--max-retries, _load_trials uses _trial_rank for deterministic best-trial selection, and _is_done correctly re-runs persisted error samples. Logic is sound and well-tested.
vero/src/vero/harbor/build/compiler.py Compiler for Harbor task dirs. Previously flagged issues (git archive returncode check, TemporaryDirectory cleanup) are fixed. Missing compile-time validation for config.task in Mode A and single-quoted TOML path handling in _rewrite_vero_source_path are minor new gaps.
vero/src/vero/harbor/build/config.py BuildConfig Pydantic model with sensible defaults; from_file resolves relative paths against the config directory. Well-structured.
vero/src/vero/harbor/build/templates/seed.sh.j2 Generates the seed script; trust-boundary advisory note is accurate and tested. read_only_paths values are interpolated into double-quoted shell strings without escaping.
vero/src/vero/harbor/build/templates/docker-compose.yaml.j2 Compose template correctly isolates secrets to the sidecar via ${VAR:?} fail-fast guards, mounts token volume read-only on main, and gates main startup on the sidecar health check.
vero/tests/test_harbor_build.py Good coverage: structure, ServeConfig round-trip, TOML/YAML parsing, vero path rewriting, shared baseline SHA, git archive failure, missing-secret gate, and advisory read-only documentation.
vero/tests/test_harbor_runner.py Thorough runner tests: command flags, reward extraction, collation with missing trials, resume skipping done samples, and the TestReviewFixes class verifying both previously-flagged issues are resolved.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Caller as EvalFramework
    participant HR as HarborRunner
    participant FS as Filesystem (jobs_dir)
    participant Harbor as harbor CLI (nested run)
    participant Sidecar as eval-sidecar (per task)
    participant Store as vero SampleResult store

    Caller->>HR: produce_sample_results(workspace, params, result_dir)
    HR->>HR: _task_names_for(params)
    HR->>Store: _is_done? (filter pending)
    HR->>Harbor: uv run harbor run --n-attempts N --max-retries R -i task1 ... --jobs-dir jobs_dir
    loop per task
        Harbor->>Sidecar: run task
        Sidecar-->>FS: write result.json
    end
    Harbor-->>HR: exit code (non-zero is non-fatal)
    HR->>FS: _load_trials (rglob, pick best by _trial_rank)
    loop per (sample_id, task_name)
        HR->>Store: _is_done? skip if succeeded
        HR->>HR: _sample_result(trial, sample_id)
        HR->>Store: save_sample_result
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Caller as EvalFramework
    participant HR as HarborRunner
    participant FS as Filesystem (jobs_dir)
    participant Harbor as harbor CLI (nested run)
    participant Sidecar as eval-sidecar (per task)
    participant Store as vero SampleResult store

    Caller->>HR: produce_sample_results(workspace, params, result_dir)
    HR->>HR: _task_names_for(params)
    HR->>Store: _is_done? (filter pending)
    HR->>Harbor: uv run harbor run --n-attempts N --max-retries R -i task1 ... --jobs-dir jobs_dir
    loop per task
        Harbor->>Sidecar: run task
        Sidecar-->>FS: write result.json
    end
    Harbor-->>HR: exit code (non-zero is non-fatal)
    HR->>FS: _load_trials (rglob, pick best by _trial_rank)
    loop per (sample_id, task_name)
        HR->>Store: _is_done? skip if succeeded
        HR->>HR: _sample_result(trial, sample_id)
        HR->>Store: save_sample_result
    end
Loading

Comments Outside Diff (2)

  1. vero/src/vero/harbor/runner.py, line 739-752 (link)

    P1 n_attempts and max_retries silently dropped from harbor run

    HarborConfig exposes n_attempts and max_retries as typed fields (not extra_args pass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in _build_command, so every harbor run invocation uses whatever the harbor defaults are regardless of what the caller configured. A user who sets max_retries=0 to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: vero/src/vero/harbor/runner.py
    Line: 739-752
    
    Comment:
    **`n_attempts` and `max_retries` silently dropped from harbor run**
    
    `HarborConfig` exposes `n_attempts` and `max_retries` as typed fields (not `extra_args` pass-throughs), signalling that callers expect to configure retry behavior. Neither field is translated to a CLI flag in `_build_command`, so every `harbor run` invocation uses whatever the harbor defaults are regardless of what the caller configured. A user who sets `max_retries=0` to disable retries on a budget-sensitive run will get the default retry behavior and may blow their budget.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

  2. vero/src/vero/harbor/runner.py, line 799-813 (link)

    P1 Last-write-wins on duplicate task_name when harbor retries a task

    rglob("result.json") visits all result files across all timestamp directories under jobs_dir. If harbor's n_attempts or max_retries causes the same task to produce multiple trial entries with the same task_name, trials[task_name] = data simply overwrites with whichever file rglob happens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: vero/src/vero/harbor/runner.py
    Line: 799-813
    
    Comment:
    **Last-write-wins on duplicate `task_name` when harbor retries a task**
    
    `rglob("result.json")` visits all result files across all timestamp directories under `jobs_dir`. If harbor's `n_attempts` or `max_retries` causes the same task to produce multiple trial entries with the same `task_name`, `trials[task_name] = data` simply overwrites with whichever file `rglob` happens to yield last — filesystem iteration order is not defined. This can cause a passing retry to be silently replaced by a failing one (or vice versa), yielding an incorrect score for that sample.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Reviews (2): Last reviewed commit: "Merge pull request #9 from scaleapi/harb..." | Re-trigger Greptile

- Mode B (runner.py): `HarborRunner`, an `EvalStrategy` that — for each candidate —
  runs a *nested* `harbor run` of the agent over the selected Harbor tasks (e.g. on
  Modal) and collates the verifier rewards into vero `SampleResult`s. One Harbor task
  = one sample; inference is fully delegated, scoring comes from Harbor's verifier.
- The compiler (build/): `vero harbor build` renders a `BuildConfig` into a runnable
  Harbor task directory — a Docker Compose environment (optimizer workbench `main` +
  the eval sidecar + three volumes), two Dockerfiles, instruction.md, tests/test.sh,
  and the seed/solve scripts — baking the dataset/scorer/baseline repo and the
  sidecar's ServeConfig. Supports Mode A (local dataset/scorer) and Mode B (a registry
  or local Harbor benchmark, passed through to the HarborConfig).
- `.gitignore`: un-ignore src/vero/harbor/build/ (the repo's `build/` rule was hiding
  the compiler package).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@varunursekar varunursekar requested a review from a team June 24, 2026 18:15
@varunursekar varunursekar marked this pull request as ready for review June 24, 2026 18:22
Comment thread vero/src/vero/harbor/build/compiler.py
Comment thread vero/src/vero/harbor/build/compiler.py
# Execute
# ------------------------------------------------------------------

def _build_command(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HarborConfig.n_attempts / max_retries are typed fields, which signals they are honored, but _build_command never emits flags for them — so every nested run uses harbor defaults. On a budget-graded run that is a real cost/correctness bug. Either map them to the real harbor run flags here, or drop the fields. (Confirms Greptile's P1.)

continue
task_name = data.get("task_name")
if task_name:
trials[task_name] = data

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trials[task_name] = data is last-write-wins over an unordered rglob; with retries or a reused jobs_dir, a failing attempt can clobber a passing one, feeding a wrong score into auto_best. Collect all matches and pick deterministically (latest mtime / highest attempt index), and/or namespace jobs_dir per attempt. (Confirms Greptile's P1.)

values = [float(v) for v in rewards.values()]
return sum(values) / len(values) if values else 0.0

def _existing(self, params: EvaluationParameters, sample_id: int) -> SampleResult | None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_existing treats a persisted error sample as 'done', so a sample whose nested run failed once is permanently skipped on resume and silently degrades the candidate's score. Either don't persist error samples, or make the pending/skip predicate treat error results as not-done.

Comment thread vero/src/vero/harbor/build/compiler.py Outdated
["tar", "xf", "-", "--strip-components", str(strip)],
cwd=dest, stdin=archive.stdout, check=True,
)
archive.wait()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archive.wait()'s return code is discarded; if git archive fails, tar can still exit 0 on a truncated stream and you commit a near-empty baseline whose SHA gets baked into both images. Assert archive.wait() == 0 (and reap the Popen on the tar-failure path). (Confirms Greptile.)

environment:
VERO_HOME_DIR: "/opt/vero_home"
{% for secret in secrets %}
{{ secret }}: "${{ '{' }}{{ secret }}{{ '}' }}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This renders SECRET: "${SECRET}", resolved from the run host's env, not anything the compiler validates. An unset var interpolates to empty and the sidecar comes up credential-less, failing every eval opaquely. Consider failing the build / emitting a required-secrets manifest when a declared secret is missing. Note test_rendered_files_parse only asserts the literal ${...} string, which hides this.

# ...except locked paths (e.g. the scorer): root-owned + unwritable.
if [ -e "/work/agent/{{ p }}" ]; then
chown -R root:root "/work/agent/{{ p }}"
chmod -R a-w "/work/agent/{{ p }}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chown root:root + chmod -R a-w applies to the working tree under /work/agent, but the verifier never executes the working tree: _transfer_commit fetches git blobs and the evaluator checks the commit out into a fresh temp copy. The agent can git add -f a modified scorer and commit it regardless of working-tree perms. So read_only_paths is not a tamper control — please document it as advisory-only and move scorer-provenance enforcement into the sidecar (see the Mode A scorer-provenance comment on #4).

…mkdtemp cleanup

Mode B runner + build-compiler robustness (review findings on PR #5):

- runner: emit --n-attempts / --max-retries (the typed HarborConfig fields were
  silently dropped); pick the best trial per task deterministically instead of
  last-write-wins over an unordered rglob; treat a persisted error sample as
  not-done so a transient failure is re-run on resume.
- compiler: assert git archive exited 0 (and reap it) so a truncated stream
  cannot bake a near-empty baseline; validate declared secrets at build time and
  render compose secrets with a fail-fast guard so an unset host var fails loudly
  instead of producing a credential-less sidecar.
- seed.sh: document that read_only_paths is advisory only and scorer provenance
  is enforced sidecar-side.
- compiler: stage the dataset in a cleaned-up TemporaryDirectory (Greptile:
  the mkdtemp scratch dir was leaking, datasets can be gigabytes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(harbor): nested-run flags/dedup/resume, archive + secret guards [fixes 3/4 compiler]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants