Skip to content

fix(adjudicator): refute exploitable verdicts with no evidence anchor + clarify runtime/secret evidence in the prompt#130

Merged
thejefflarson merged 1 commit into
mainfrom
fix/guard-unsupported-exploitable
Jun 30, 2026
Merged

fix(adjudicator): refute exploitable verdicts with no evidence anchor + clarify runtime/secret evidence in the prompt#130
thejefflarson merged 1 commit into
mainfrom
fix/guard-unsupported-exploitable

Conversation

@thejefflarson

Copy link
Copy Markdown
Owner

Why

An internet-facing watcher-server Pod came back exploitable with reason "connects to exposed secrets which are mounted into the pod…" — a false breach. Its evidence: CVEs (none), no exposed secret baked into the image, runtime behavior = three benign NetworkConnections to its own DB/metrics; all objectives [MOUNTED] (own creds) or [NETWORK] [same-ns] (own DB). Correct verdict: refuted. The 1B judge fabricated evidence by (a) treating benign network connections as a live signal and (b) conflating reaching a secret/… objective with an exposed secret in the image.

There was already a guard_fabricated_cve backstop; this adds the symmetric zero-anchor one for unsupported exploitable.

The guard

guard_unsupported_exploitable (in guards.rs, mirroring guard_fabricated_cve's shape via the shared guard_exploitable gate) downgrades an Exploitable verdict to Refuted ONLY when ALL THREE exploitation anchors are absent:

  • the CVE evidence list is empty (no CVE shown to the model), AND
  • there is no exposed-secret finding for the entry, AND
  • no observed behavior is corroborating.

"Corroborating runtime behavior" reuses the engine's existing definition — Behavior::is_alert() (a critical Falco alert) OR exec_class::notable_exec(&behavior).is_some() (a notable shell/pkg-manager exec, JEF-117). Benign NetworkConnection/FileRead/LibraryLoaded/SecretRead are not corroborating and never anchor an exploitable.

Conservative by design: if any anchor is present — a CVE in the list (even reachability:not-observed), an exposed secret, or a corroborating behavior — the model's (debatable) call stands untouched. This is purely the zero-anchor safety net. Like the fabrication guard it only ever acts on Exploitable; the entry is re-judged next pass.

Exposed-secret presence is read from the same source the prompt usesentry_findings(graph, entry) returns (secret_lines, posture_lines); a non-empty secret_lines means a usable credential is baked into the image (posture/RBAC is not an anchor). Wired in model_call.rs chained after guard_fabricated_cve.

Prompt clarifications

Two surgical additions (existing structure/wording preserved):

  1. Runtime-behavior bullet: a workload's OWN observed activity (outbound network connections, file reads, library loads, reading its own mounted secrets) is normal behavior and NOT a live signal — only an ALERT or hands-on-keyboard action counts.
  2. Secrets bullet: reaching a secret/… objective (a Credential-Access OUTCOME in the reachable-objectives list) is NOT an exposed secret baked into the image — only a credential in the "Exposed secrets baked into this image" field is exploitation evidence.

Fingerprint shift: changing the prompt string deterministically shifts the verdict-cache fingerprint inputs at the prompt level, so entries re-judge once. Expected. No code-level snapshot pins the prompt text; the only test affected was the prompt-size bound (raised from 4,000 to 5,000 to account for the larger static template — the assertion still proves the untrusted-payload cap, since a megabyte title would blow past it by orders of magnitude).

Tests

  • guard fires: Exploitable + empty CVEs + no exposed secret + only benign behaviors (the watcher case + misc benign) → Refuted.
  • guard preserves the verdict in each anchored case: a CVE present, an exposed-secret finding present, a corroborating alert, and a notable exec.
  • guard leaves non-Exploitable verdicts (Refuted/Confirmed/Uncertain) untouched.
  • two prompt-content assertions for the clarifications.

All existing adjudicate tests kept green.

Gates (from engine/)

cargo fmt · cargo build · cargo clippy --all-targets (clean, warnings = errors) · cargo test353 passed, 0 failed, 1 ignored (the e2e test needing PROTECTOR_E2E_MODEL). File-size guard green.

Closes JEF-watcher-false-breach.

🤖 Generated with Claude Code

… + clarify runtime/secret evidence in the prompt

An internet-facing watcher-server Pod came back `exploitable` ("connects to
exposed secrets which are mounted into the pod…") — a false breach. Its evidence:
CVEs (none), no exposed secret baked into the image, runtime = three benign
NetworkConnections to its own DB/metrics. The 1B judge fabricated evidence by
treating benign connections as a live signal and conflating reaching a secret/…
objective with an exposed secret in the image. Correct verdict: refuted.

Add the symmetric backstop to guard_fabricated_cve: guard_unsupported_exploitable
downgrades an Exploitable verdict to Refuted ONLY when ALL THREE exploitation
anchors are absent — empty CVE list, no exposed-secret finding, and no
corroborating runtime behavior (Behavior::is_alert() or exec_class::notable_exec,
the engine's existing definition; benign Network/File/Library/SecretRead are NOT
corroborating). Any anchor present leaves the model's call untouched. Wired after
guard_fabricated_cve in model_call; exposed-secret presence read from the same
entry_findings source the prompt uses.

Also two surgical prompt clarifications: a workload's own activity (network
connections, file reads, library loads, reading its own mounted secrets) is NOT a
live signal — only an ALERT or hands-on-keyboard action is; and reaching a
secret/… objective is NOT an exposed secret baked into the image. This shifts the
verdict fingerprint, so entries re-judge once.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VtjoJttCvBY4dzCoE4f9vP
@thejefflarson thejefflarson merged commit f8dd4b4 into main Jun 30, 2026
4 of 5 checks passed
@thejefflarson thejefflarson deleted the fix/guard-unsupported-exploitable branch June 30, 2026 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant