Solving the Claude reliability problem.
An AI agent that confidently reports work it never verified is worse than useless — it makes you the quality-control layer. This is the project record: the goal, the real failure log, the research it rests on, the voices that review the work, and the rules, gates and code that fixed it.
failures.md.01 — The goal
Have the agent produce a finished, on-brand deliverable — and be able to rely on its "done." Not "done" as a feeling. "Done" as a claim backed by evidence anyone can audit.
Outcome ≠ transcript
The model's report of what it did is a separate act from the doing — and nothing in the base loop forces them to agree. A flight agent says "booked"; the only truth is whether a row exists in the database. Ours: the agent says "the video looks great" — the only truth is what plays on screen.
02 — The thinking
The pattern in the log was not random incompetence. It was one shape repeating: the agent narrated the intended result of an action as if it had verified it. Sign-off from intent, not inspection. So the fix could not be "try harder" — a discipline you can skip is not a control. It had to be structural: make the cheap honest path (open the file, run the check) the only path to "done," and put a human at the two gates where judgement was being faked.
The point isn't that the agent never errs. It's that an error can no longer reach you dressed as "done."
03 — The research
This is the central, documented limitation of agentic systems — not a local quirk. The fixes below are drawn from the literature, then encoded as gates.
The mechanisms, tagged by how much they can be trusted
| Mechanism | Kills | Trust |
|---|---|---|
| Artifact-exists + media-integrity probe | false completion | robust |
| Required-content string check (OCR) | "text's all there" | robust |
| Layout-defect measurement (clip, overlap, blank, hex) | visual-verify gap | robust |
| Cross-model vision pass (different model) | self-preference | partial |
| Per-criterion rubric, position-shuffled | unreliable self-eval | partial |
| Human approval interrupt | subjective taste | robust |
| Generator declaring "looks good" from intent | nothing | theatre |
| Same-model self-critique / debate | nothing | theatre |
04 — The failures
50 logged over ~7 weeks. Grouped by root cause, straight from the log:
The full log — 50 entries, nothing hidden
| Date | Category | Logged as |
|---|---|---|
| 2026-06-15 | Deploy/infra | FAIL-HYPERADAR-V2-OAUTH-STARTTIME-401 |
| 2026-06-15 | Assumed | FAIL-CREDENTIAL-ASSUMED-MISSING |
| 2026-06-10 | Deploy/infra | FAIL-PASTE-DEPLOY |
| 2026-06-08 | Assumed | FAIL-SSH-DENIED-FROM-STALE-README |
| 2026-06-08 | Destructive | FAIL-BLEED-DECLARED-STOPPED-BEFORE-KILLING-INFLIGHT |
| 2026-06-05 | Assumed | FAIL-MARY-RETIRED-MISREAD |
| 2026-06-04 | Assumed | FAIL-SCOTTY-BLOCKED-FROM-MISREAD-NOT-PROBE |
| 2026-06-03 | Process | FAIL-ASSLICKING-REGISTER |
| 2026-06-03 | Visual | FAIL-DOP-CARD-FILLED-FROM-INTENT-NOT-LOOKING |
| 2026-06-03 | False claim | FAIL-BACKUP-STRATEGY-IS-FICTION |
| 2026-06-03 | False claim | FAIL-MINI-GAP-FRAMED-AS-NEVER-WORKED |
| 2026-06-02 | Process | FAIL-ASKUSERQUESTION-DIED-WENT-SILENT |
| 2026-06-02 | Deploy/infra | FAIL-DEPLOY-HANDED-PASTE-BLOCK |
| 2026-06-02 | Assumed | FAIL-ALLOWLIST-UNNECESSARY-ASK |
| 2026-06-01 | False claim | FAIL-UNREACHABLE-ASSERTION |
| 2026-05-24 | Deploy/infra | FAIL-DEPLOY-TAILSCALE-SILENT-HANG |
| 2026-06-01 | Assumed | FAIL-CANON-BLIND-MEETINGS |
| 2026-06-01 | False claim | FAIL-PHANTOM-URL-BOOKING-LINKS |
| 2026-06-01 | Deploy/infra | FAIL-CANONICAL-ALIAS-DRIFT |
| 2026-06-01 | Deploy/infra | FAIL-LOGODEV-TOKEN-DEAD-404 |
| 2026-06-02 | Assumed | FAIL-STALE-CANON-OVER-MIGRATION-LOG |
| 2026-06-02 | Process | FAIL-RULE11-SILENT-ON-TOOL-ERROR |
| 2026-06-02 | Assumed | FAIL-DISKFULL-FALSE-CANT-RECLAIM |
| 2026-06-02 | Other | FAIL-BOOTSTRAP-SHARED-LOG-EACCES |
| 2026-06-02 | Other | FAIL-LAUNCHD-TCC-VOLUMES-UNREADABLE |
| 2026-06-03 | Other | FAIL-DRUMBEAT-MISSED-CYBER |
| 2026-06-03 | False claim | FAIL-STUDIOSERVER-STALE-HEADLINE-PANIC |
| 2026-06-03 | Deploy/infra | FAIL-DEPLOY-EQUALSFIVE-NO-CRED-RAIL |
| 2026-06-04 | False claim | FAIL-SESSIONS-DISKFULL-IS-A-PHANTOM-BLOCKER |
| 2026-06-04 | Destructive | FAIL-FIRED-RSYNC-BLIND-NO-PREFLIGHT |
| 2026-06-05 | Other | FAIL-MEDIAROUNDUP-WRONG-PROPOSAL |
| 2026-06-05 | Process | FAIL-REVIEWABLE-SURFACE-AS-FILE-CARD |
| 2026-06-05 | False claim | FAIL-STUDIOSERVER-STATUS-OMITTED-MIRROR |
| 2026-06-05 | Visual | FAIL-IMAGE-SIGNED-OFF-FROM-INTENT-x5 |
| 2026-06-08 | Destructive | FAIL-PLEX-MUSIC-WIPE |
| 2026-06-08 | Destructive | FAIL-MIRROR-WIPE |
| 2026-06-09 | Deploy/infra | FAIL-PASTE-DEPLOY-VERCEL-PAT-MISSING |
| 2026-06-10 | Process | FAIL-30SEC-VERBOSE-INVOICE-SESSION |
| 2026-06-10 | Process | FAIL-SILENT-SUBAGENT-RUN |
| 2026-06-10 | Process | FAIL-VERBOSE-x2 |
| 2026-06-10 | Deploy/infra | FAIL-CWDEPLOY-SANDBOX-PATH+CEILING |
| 2026-06-10 | Process | FAIL-INTERNAL-EXTERNAL-SURFACE-MIX |
| 2026-06-14 | Other | FAIL-DAISY-UNKNOWN |
| 2026-06-15 | Other | FAIL-HYPERADAR-API-TIMELINE-LOSSY |
| 2026-06-15 | Other | FAIL-RELATIVE-ASSET-LINK-LABS |
| 2026-06-19 | Other | FAIL-HYPERADAR-FORYOU-SLICE-AND-TOOLS-ONLY-FILTER |
| 2026-06-19 | False claim | FAIL-HYPERADAR-6H-SILENTLY-DEAD-SINCE-0615 + FAIL-CLAIMED-NO-FEED-WITHOUT-CHECKING |
| 2026-06-19 | False claim | FAIL-HEDGED-NOT-SOLD |
| 2026-06-19 | Visual | FAIL-VERSION-IN-META-NOT-ON-PAGE + FAIL-DEPLOY-ALIAS-SILENT |
| 2026-06-19 | Deploy/infra | FAIL-DEPLOY-VERCEL-SHIM-1DAY |
05 — The voices
Not a local quirk — the field is actively working on exactly this. The people thinking hardest about context, hallucination and agent reliability, and where they land. Handles verified; takes link to source.
Context engineering
Andrej Karpathy
"Context engineering: filling the context window with just the right information for the next step." ↗
Tobi Lütke
Prefers "context engineering" over "prompt engineering" — it names the core skill better. ↗
Simon Willison
The term sticks because its meaning matches the practice — unlike "prompt engineering." ↗
Anthropic
Context is "a critical but finite resource" — agents fail from bloated or rotted context, not weak models. ↗
Why models assert false things
Adam Tauman Kalai et al.
Models hallucinate because training and benchmarks reward confident guessing over honest "I don't know." ↗
Agent reliability, evals & verification
Walden Yan
Reliability is a context-engineering problem; multi-agent systems break when sub-agents lose shared context. ↗
Harrison Chase
Context engineering = dynamic systems giving the LLM the right info, tools and format to finish the task. ↗
Hamel Husain
No reliable AI without evals: look at your data, build task-specific tests, measure judge-vs-human agreement. ↗
06 — The actions
A rule you can ignore is not a control. The fix runs at three levels.
The pre-send gate
Two passes before any message — attack the draft, then truth-check every claim against this session's evidence. No "done/verified/looks good" without the artifact in the same message; the check shows its receipt.
A pipeline that stops
Render every screen fully → human sign-off before any video is built; a mechanical watch-gate; a final human sign-off before ship.
Deterministic QC
A checker that measures defects against the render: clipping, overlap, alignment, blank-canvas, missing audio, black frames. The model never "sees" the bug — the code proves it.
Regression suite — every past failure, re-checked every build
Each logged defect becomes a locked test case. The checker's verdict on the current build: