Reliability log · v2

Solving the Claude reliability problem.

An AI agent that confidently reports work it never verified is worse than useless — it makes you the quality-control layer. This is the project record: the goal, the real failure log, the research it rests on, the voices that review the work, and the rules, gates and code that fixed it.

Failures logged

Every one written down, dated, root-caused. Source: failures.md.

Window

2026-05-24 → 2026-06-19

~7 weeks of building in the open.

Root-cause categories

Derived from the log, not guessed.

Current build

PASS

Green by the deterministic checker — proven, not asserted.

01 — The goal

Have the agent produce a finished, on-brand deliverable — and be able to rely on its "done." Not "done" as a feeling. "Done" as a claim backed by evidence anyone can audit.

The failure we set out to kill

Outcome ≠ transcript

The model's report of what it did is a separate act from the doing — and nothing in the base loop forces them to agree. A flight agent says "booked"; the only truth is whether a row exists in the database. Ours: the agent says "the video looks great" — the only truth is what plays on screen.

02 — The thinking

The pattern in the log was not random incompetence. It was one shape repeating: the agent narrated the intended result of an action as if it had verified it. Sign-off from intent, not inspection. So the fix could not be "try harder" — a discipline you can skip is not a control. It had to be structural: make the cheap honest path (open the file, run the check) the only path to "done," and put a human at the two gates where judgement was being faked.

The point isn't that the agent never errs. It's that an error can no longer reach you dressed as "done."

03 — The research

This is the central, documented limitation of agentic systems — not a local quirk. The fixes below are drawn from the literature, then encoded as gates.

Self-correction without an external signal is unreliable and can make output worse (Kamoi et al., TACL 2024; Huang et al., ICLR 2024). Training biases models toward confident, finished-sounding answers — sycophancy (Perez et al., 2022). And vision-language models are weakest at exactly the low-level defects that sink a deliverable: clipping, alignment, size, contrast. So: verify with code against the artifact; never trust the model's report of itself.

The mechanisms, tagged by how much they can be trusted

Mechanism	Kills	Trust
Artifact-exists + media-integrity probe	false completion	robust
Required-content string check (OCR)	"text's all there"	robust
Layout-defect measurement (clip, overlap, blank, hex)	visual-verify gap	robust
Cross-model vision pass (different model)	self-preference	partial
Per-criterion rubric, position-shuffled	unreliable self-eval	partial
Human approval interrupt	subjective taste	robust
Generator declaring "looks good" from intent	nothing	theatre
Same-model self-critique / debate	nothing	theatre

04 — The failures

50 logged over ~7 weeks. Grouped by root cause, straight from the log:

Deploy/infra · 10

The build & ship chain

False claim · 9

Asserted without checking

Assumed · 8

Never probed

Process · 8

Comms & discipline

Other · 8

Misc

Destructive · 4

Data-loss risk

Visual · 3

Signed off from intent, not pixels

The full log — 50 entries, nothing hidden

Date	Category	Logged as
2026-06-15	Deploy/infra	FAIL-HYPERADAR-V2-OAUTH-STARTTIME-401
2026-06-15	Assumed	FAIL-CREDENTIAL-ASSUMED-MISSING
2026-06-10	Deploy/infra	FAIL-PASTE-DEPLOY
2026-06-08	Assumed	FAIL-SSH-DENIED-FROM-STALE-README
2026-06-08	Destructive	FAIL-BLEED-DECLARED-STOPPED-BEFORE-KILLING-INFLIGHT
2026-06-05	Assumed	FAIL-MARY-RETIRED-MISREAD
2026-06-04	Assumed	FAIL-SCOTTY-BLOCKED-FROM-MISREAD-NOT-PROBE
2026-06-03	Process	FAIL-ASSLICKING-REGISTER
2026-06-03	Visual	FAIL-DOP-CARD-FILLED-FROM-INTENT-NOT-LOOKING
2026-06-03	False claim	FAIL-BACKUP-STRATEGY-IS-FICTION
2026-06-03	False claim	FAIL-MINI-GAP-FRAMED-AS-NEVER-WORKED
2026-06-02	Process	FAIL-ASKUSERQUESTION-DIED-WENT-SILENT
2026-06-02	Deploy/infra	FAIL-DEPLOY-HANDED-PASTE-BLOCK
2026-06-02	Assumed	FAIL-ALLOWLIST-UNNECESSARY-ASK
2026-06-01	False claim	FAIL-UNREACHABLE-ASSERTION
2026-05-24	Deploy/infra	FAIL-DEPLOY-TAILSCALE-SILENT-HANG
2026-06-01	Assumed	FAIL-CANON-BLIND-MEETINGS
2026-06-01	False claim	FAIL-PHANTOM-URL-BOOKING-LINKS
2026-06-01	Deploy/infra	FAIL-CANONICAL-ALIAS-DRIFT
2026-06-01	Deploy/infra	FAIL-LOGODEV-TOKEN-DEAD-404
2026-06-02	Assumed	FAIL-STALE-CANON-OVER-MIGRATION-LOG
2026-06-02	Process	FAIL-RULE11-SILENT-ON-TOOL-ERROR
2026-06-02	Assumed	FAIL-DISKFULL-FALSE-CANT-RECLAIM
2026-06-02	Other	FAIL-BOOTSTRAP-SHARED-LOG-EACCES
2026-06-02	Other	FAIL-LAUNCHD-TCC-VOLUMES-UNREADABLE
2026-06-03	Other	FAIL-DRUMBEAT-MISSED-CYBER
2026-06-03	False claim	FAIL-STUDIOSERVER-STALE-HEADLINE-PANIC
2026-06-03	Deploy/infra	FAIL-DEPLOY-EQUALSFIVE-NO-CRED-RAIL
2026-06-04	False claim	FAIL-SESSIONS-DISKFULL-IS-A-PHANTOM-BLOCKER
2026-06-04	Destructive	FAIL-FIRED-RSYNC-BLIND-NO-PREFLIGHT
2026-06-05	Other	FAIL-MEDIAROUNDUP-WRONG-PROPOSAL
2026-06-05	Process	FAIL-REVIEWABLE-SURFACE-AS-FILE-CARD
2026-06-05	False claim	FAIL-STUDIOSERVER-STATUS-OMITTED-MIRROR
2026-06-05	Visual	FAIL-IMAGE-SIGNED-OFF-FROM-INTENT-x5
2026-06-08	Destructive	FAIL-PLEX-MUSIC-WIPE
2026-06-08	Destructive	FAIL-MIRROR-WIPE
2026-06-09	Deploy/infra	FAIL-PASTE-DEPLOY-VERCEL-PAT-MISSING
2026-06-10	Process	FAIL-30SEC-VERBOSE-INVOICE-SESSION
2026-06-10	Process	FAIL-SILENT-SUBAGENT-RUN
2026-06-10	Process	FAIL-VERBOSE-x2
2026-06-10	Deploy/infra	FAIL-CWDEPLOY-SANDBOX-PATH+CEILING
2026-06-10	Process	FAIL-INTERNAL-EXTERNAL-SURFACE-MIX
2026-06-14	Other	FAIL-DAISY-UNKNOWN
2026-06-15	Other	FAIL-HYPERADAR-API-TIMELINE-LOSSY
2026-06-15	Other	FAIL-RELATIVE-ASSET-LINK-LABS
2026-06-19	Other	FAIL-HYPERADAR-FORYOU-SLICE-AND-TOOLS-ONLY-FILTER
2026-06-19	False claim	FAIL-HYPERADAR-6H-SILENTLY-DEAD-SINCE-0615 + FAIL-CLAIMED-NO-FEED-WITHOUT-CHECKING
2026-06-19	False claim	FAIL-HEDGED-NOT-SOLD
2026-06-19	Visual	FAIL-VERSION-IN-META-NOT-ON-PAGE + FAIL-DEPLOY-ALIAS-SILENT
2026-06-19	Deploy/infra	FAIL-DEPLOY-VERCEL-SHIM-1DAY

05 — The voices

Not a local quirk — the field is actively working on exactly this. The people thinking hardest about context, hallucination and agent reliability, and where they land. Handles verified; takes link to source.

Context engineering

@karpathy

Andrej Karpathy

"Context engineering: filling the context window with just the right information for the next step." ↗

@tobi

Tobi Lütke

Prefers "context engineering" over "prompt engineering" — it names the core skill better. ↗

@simonw

Simon Willison

The term sticks because its meaning matches the practice — unlike "prompt engineering." ↗

@AnthropicAI

Anthropic

Context is "a critical but finite resource" — agents fail from bloated or rotted context, not weak models. ↗

Why models assert false things

OpenAI · paper

Adam Tauman Kalai et al.

Models hallucinate because training and benchmarks reward confident guessing over honest "I don't know." ↗

Agent reliability, evals & verification

@walden_yan

Walden Yan

Reliability is a context-engineering problem; multi-agent systems break when sub-agents lose shared context. ↗

@hwchase17

Harrison Chase

Context engineering = dynamic systems giving the LLM the right info, tools and format to finish the task. ↗

@HamelHusain

Hamel Husain

No reliable AI without evals: look at your data, build task-specific tests, measure judge-vs-human agreement. ↗

@sh_reya

Shreya Shankar

LLM judges drift from human judgment; evaluation criteria must be iteratively aligned to humans. ↗

06 — The actions

A rule you can ignore is not a control. The fix runs at three levels.

Layer 1 — Rules

The pre-send gate

Two passes before any message — attack the draft, then truth-check every claim against this session's evidence. No "done/verified/looks good" without the artifact in the same message; the check shows its receipt.

Layer 2 — Gates

A pipeline that stops

Render every screen fully → human sign-off before any video is built; a mechanical watch-gate; a final human sign-off before ship.

Layer 3 — Code

Deterministic QC

A checker that measures defects against the render: clipping, overlap, alignment, blank-canvas, missing audio, black frames. The model never "sees" the bug — the code proves it.

Regression suite — every past failure, re-checked every build

Each logged defect becomes a locked test case. The checker's verdict on the current build:

$ qc.py deck  vbb-deck-v3.html  s1..s8

  s4  chart        spread=210   ok
  s5  value cols    2 columns    ok
  s6  value cols    1 column     ok
  s1  clip/overlap  none         ok
  s8  clip/overlap  none         ok

  EXIT=0   FAILS: NONE — all 8 slides pass

Articulate · reliability log v2 · spun out of the GIG insurance video project · generated from failures.md (50 entries) on 2026-06-19. The page is itself the artifact. · v1 archived