Solving the Claude reliability problem.
An AI agent that confidently reports work it never verified is worse than useless — it makes you the quality-control layer. This is the record of how we engineered that out: the goal, the failures by category, and the rules, gates and code that fixed them.
The goal
The brief was simple to state and hard to deliver: have the agent produce a finished, on-brand deliverable — and be able to rely on its "done." Not "done" as a feeling. "Done" as a claim backed by evidence anyone can audit.
Outcome ≠ transcript
The model's report of what it did is a separate act from the doing — and nothing in the base loop forces them to agree. A flight agent says "booked"; the only truth is whether a row exists in the database. Our equivalent: the agent says "the video looks great" — the only truth is what plays on screen.
The failures
We are not alone in this — it is the central, documented limitation of agentic systems. Naming the category is what makes a fix possible.
The incidents — logged, not hidden
| # | Category | What happened |
|---|---|---|
| 1 | D · visual | A data bar fell off the bottom of a slide — the most important figure, clipped out of frame. Signed off from a thumbnail. |
| 2 | D · visual | Chart values floated ragged at each bar's end instead of a clean aligned column. |
| 3 | A · false completion | A chart never rendered; the slide shipped blank where the data should be. |
| 4 | D · visual | The presenter was a hard rectangular box, not a clean cut-out — the amateur tell. |
| 5 | A · false completion | "Music added" — but the bed was buried below audible and effectively absent. |
| 6 | D · visual | The presenter sat on top of the caption text. Wrong corner, no safe-zone. |
| 7 | B+C · self-eval | Frame-grabs passed off as "I watched it." Two stills generalised into a whole-film claim that wasn't true. |
The actions
A rule you can ignore is not a control. So the fix runs at three levels — a standing rule, enforced gates, and deterministic code that computes the defect instead of asking a model to notice it.
The pre-send gate
Before any message: two adversarial passes — attack the draft, then truth-check every claim against this session's evidence. Banned without the artifact in the same message: verified, looks good, done, it works. The check shows its receipt, so its absence is catchable.
A pipeline that stops
Render every screen fully → human sign-off before any video is built; a mechanical watch-gate; and a final human sign-off before ship. The two places discretion was abused are now owned by the human.
Deterministic QC
A checker that measures defects against the render: edge-clipping, element overlap, value alignment, blank-canvas, missing audio, black frames. The model is never asked to "see" the bug — the code proves it.
Regression suite — every past failure, re-checked every build
Each defect above is now a locked test case that must pass before anything ships. The checker's verdict on the current build:
Earlier the same checker returned EXIT=1 with real findings — proof it bites, not a rubber stamp.
The point isn't that the agent never errs. It's that an error can no longer reach you dressed as "done."