Articulate. Home What I do Reliability log
Reliability log · v2

Solving the Claude reliability problem.

An AI agent that confidently reports work it never verified is worse than useless — it makes you the quality-control layer. This is the project record: the goal, the real failure log, the research it rests on, the voices that review the work, and the rules, gates and code that fixed it.

Failures logged
50
Every one written down, dated, root-caused. Source: failures.md.
Window
2026-05-24 → 2026-06-19
~7 weeks of building in the open.
Root-cause categories
7
Derived from the log, not guessed.
Current build
PASS
Green by the deterministic checker — proven, not asserted.

01 — The goal

Have the agent produce a finished, on-brand deliverable — and be able to rely on its "done." Not "done" as a feeling. "Done" as a claim backed by evidence anyone can audit.

The failure we set out to kill

Outcome ≠ transcript

The model's report of what it did is a separate act from the doing — and nothing in the base loop forces them to agree. A flight agent says "booked"; the only truth is whether a row exists in the database. Ours: the agent says "the video looks great" — the only truth is what plays on screen.

02 — The thinking

The pattern in the log was not random incompetence. It was one shape repeating: the agent narrated the intended result of an action as if it had verified it. Sign-off from intent, not inspection. So the fix could not be "try harder" — a discipline you can skip is not a control. It had to be structural: make the cheap honest path (open the file, run the check) the only path to "done," and put a human at the two gates where judgement was being faked.

The point isn't that the agent never errs. It's that an error can no longer reach you dressed as "done."

03 — The research

This is the central, documented limitation of agentic systems — not a local quirk. The fixes below are drawn from the literature, then encoded as gates.

Self-correction without an external signal is unreliable and can make output worse (Kamoi et al., TACL 2024; Huang et al., ICLR 2024). Training biases models toward confident, finished-sounding answers — sycophancy (Perez et al., 2022). And vision-language models are weakest at exactly the low-level defects that sink a deliverable: clipping, alignment, size, contrast. So: verify with code against the artifact; never trust the model's report of itself.

The mechanisms, tagged by how much they can be trusted

MechanismKillsTrust
Artifact-exists + media-integrity probefalse completionrobust
Required-content string check (OCR)"text's all there"robust
Layout-defect measurement (clip, overlap, blank, hex)visual-verify gaprobust
Cross-model vision pass (different model)self-preferencepartial
Per-criterion rubric, position-shuffledunreliable self-evalpartial
Human approval interruptsubjective tasterobust
Generator declaring "looks good" from intentnothingtheatre
Same-model self-critique / debatenothingtheatre

04 — The failures

50 logged over ~7 weeks. Grouped by root cause, straight from the log:

Deploy/infra · 10
The build & ship chain
False claim · 9
Asserted without checking
Assumed · 8
Never probed
Process · 8
Comms & discipline
Other · 8
Misc
Destructive · 4
Data-loss risk
Visual · 3
Signed off from intent, not pixels

The full log — 50 entries, nothing hidden

DateCategoryLogged as
2026-06-15Deploy/infraFAIL-HYPERADAR-V2-OAUTH-STARTTIME-401
2026-06-15AssumedFAIL-CREDENTIAL-ASSUMED-MISSING
2026-06-10Deploy/infraFAIL-PASTE-DEPLOY
2026-06-08AssumedFAIL-SSH-DENIED-FROM-STALE-README
2026-06-08DestructiveFAIL-BLEED-DECLARED-STOPPED-BEFORE-KILLING-INFLIGHT
2026-06-05AssumedFAIL-MARY-RETIRED-MISREAD
2026-06-04AssumedFAIL-SCOTTY-BLOCKED-FROM-MISREAD-NOT-PROBE
2026-06-03ProcessFAIL-ASSLICKING-REGISTER
2026-06-03VisualFAIL-DOP-CARD-FILLED-FROM-INTENT-NOT-LOOKING
2026-06-03False claimFAIL-BACKUP-STRATEGY-IS-FICTION
2026-06-03False claimFAIL-MINI-GAP-FRAMED-AS-NEVER-WORKED
2026-06-02ProcessFAIL-ASKUSERQUESTION-DIED-WENT-SILENT
2026-06-02Deploy/infraFAIL-DEPLOY-HANDED-PASTE-BLOCK
2026-06-02AssumedFAIL-ALLOWLIST-UNNECESSARY-ASK
2026-06-01False claimFAIL-UNREACHABLE-ASSERTION
2026-05-24Deploy/infraFAIL-DEPLOY-TAILSCALE-SILENT-HANG
2026-06-01AssumedFAIL-CANON-BLIND-MEETINGS
2026-06-01False claimFAIL-PHANTOM-URL-BOOKING-LINKS
2026-06-01Deploy/infraFAIL-CANONICAL-ALIAS-DRIFT
2026-06-01Deploy/infraFAIL-LOGODEV-TOKEN-DEAD-404
2026-06-02AssumedFAIL-STALE-CANON-OVER-MIGRATION-LOG
2026-06-02ProcessFAIL-RULE11-SILENT-ON-TOOL-ERROR
2026-06-02AssumedFAIL-DISKFULL-FALSE-CANT-RECLAIM
2026-06-02OtherFAIL-BOOTSTRAP-SHARED-LOG-EACCES
2026-06-02OtherFAIL-LAUNCHD-TCC-VOLUMES-UNREADABLE
2026-06-03OtherFAIL-DRUMBEAT-MISSED-CYBER
2026-06-03False claimFAIL-STUDIOSERVER-STALE-HEADLINE-PANIC
2026-06-03Deploy/infraFAIL-DEPLOY-EQUALSFIVE-NO-CRED-RAIL
2026-06-04False claimFAIL-SESSIONS-DISKFULL-IS-A-PHANTOM-BLOCKER
2026-06-04DestructiveFAIL-FIRED-RSYNC-BLIND-NO-PREFLIGHT
2026-06-05OtherFAIL-MEDIAROUNDUP-WRONG-PROPOSAL
2026-06-05ProcessFAIL-REVIEWABLE-SURFACE-AS-FILE-CARD
2026-06-05False claimFAIL-STUDIOSERVER-STATUS-OMITTED-MIRROR
2026-06-05VisualFAIL-IMAGE-SIGNED-OFF-FROM-INTENT-x5
2026-06-08DestructiveFAIL-PLEX-MUSIC-WIPE
2026-06-08DestructiveFAIL-MIRROR-WIPE
2026-06-09Deploy/infraFAIL-PASTE-DEPLOY-VERCEL-PAT-MISSING
2026-06-10ProcessFAIL-30SEC-VERBOSE-INVOICE-SESSION
2026-06-10ProcessFAIL-SILENT-SUBAGENT-RUN
2026-06-10ProcessFAIL-VERBOSE-x2
2026-06-10Deploy/infraFAIL-CWDEPLOY-SANDBOX-PATH+CEILING
2026-06-10ProcessFAIL-INTERNAL-EXTERNAL-SURFACE-MIX
2026-06-14OtherFAIL-DAISY-UNKNOWN
2026-06-15OtherFAIL-HYPERADAR-API-TIMELINE-LOSSY
2026-06-15OtherFAIL-RELATIVE-ASSET-LINK-LABS
2026-06-19OtherFAIL-HYPERADAR-FORYOU-SLICE-AND-TOOLS-ONLY-FILTER
2026-06-19False claimFAIL-HYPERADAR-6H-SILENTLY-DEAD-SINCE-0615 + FAIL-CLAIMED-NO-FEED-WITHOUT-CHECKING
2026-06-19False claimFAIL-HEDGED-NOT-SOLD
2026-06-19VisualFAIL-VERSION-IN-META-NOT-ON-PAGE + FAIL-DEPLOY-ALIAS-SILENT
2026-06-19Deploy/infraFAIL-DEPLOY-VERCEL-SHIM-1DAY

05 — The voices

Not a local quirk — the field is actively working on exactly this. The people thinking hardest about context, hallucination and agent reliability, and where they land. Handles verified; takes link to source.

Context engineering

Andrej Karpathy

"Context engineering: filling the context window with just the right information for the next step."

Tobi Lütke

Prefers "context engineering" over "prompt engineering" — it names the core skill better.

Simon Willison

The term sticks because its meaning matches the practice — unlike "prompt engineering."

Anthropic

Context is "a critical but finite resource" — agents fail from bloated or rotted context, not weak models.

Why models assert false things

Adam Tauman Kalai et al.

Models hallucinate because training and benchmarks reward confident guessing over honest "I don't know."

Agent reliability, evals & verification

Walden Yan

Reliability is a context-engineering problem; multi-agent systems break when sub-agents lose shared context.

Harrison Chase

Context engineering = dynamic systems giving the LLM the right info, tools and format to finish the task.

Hamel Husain

No reliable AI without evals: look at your data, build task-specific tests, measure judge-vs-human agreement.

Shreya Shankar

LLM judges drift from human judgment; evaluation criteria must be iteratively aligned to humans.

06 — The actions

A rule you can ignore is not a control. The fix runs at three levels.

Layer 1 — Rules

The pre-send gate

Two passes before any message — attack the draft, then truth-check every claim against this session's evidence. No "done/verified/looks good" without the artifact in the same message; the check shows its receipt.

Layer 2 — Gates

A pipeline that stops

Render every screen fully → human sign-off before any video is built; a mechanical watch-gate; a final human sign-off before ship.

Layer 3 — Code

Deterministic QC

A checker that measures defects against the render: clipping, overlap, alignment, blank-canvas, missing audio, black frames. The model never "sees" the bug — the code proves it.

Regression suite — every past failure, re-checked every build

Each logged defect becomes a locked test case. The checker's verdict on the current build:

$ qc.py deck vbb-deck-v3.html s1..s8 s4 chart spread=210 ok s5 value cols 2 columns ok s6 value cols 1 column ok s1 clip/overlap none ok s8 clip/overlap none ok EXIT=0 FAILS: NONE — all 8 slides pass
Articulate · reliability log v2 · spun out of the GIG insurance video project · generated from failures.md (50 entries) on 2026-06-19. The page is itself the artifact. · v1 archived