When 16 passing tests still hide a production bug

Sixteen unit tests passed. The TypeScript compiler exited zero. The two adversarial-review rounds we had finished thirty minutes earlier had only flagged minor concerns. We were reading the diff one last time before pushing the commit when a question landed: what does the resulting settings.json actually look like, side-by-side with the CLI documentation?

The two JSON snippets did not match. Not by a comma. Not by a typo. By a structural rule we had got wrong from the beginning of the implementation. The if field — the entire point of the feature — had been emitted at the wrong nesting level. The CLI silently accepts unknown fields on parent objects, so it would have accepted our broken JSON without protest, ignored the if directive entirely, and let every hook fire unfiltered. A feature that was supposed to scope hooks to specific tools would have been a runtime no-op the moment a user enabled it.

And every test had passed, because every test had asked the wrong question.

This is a post on a class of bug that becomes structurally more likely the more LLMs you put in the loop: the test that confirms the implementation, written from the implementation, instead of the test that confirms the spec.

The feature that should have been simple

Claude Code CLI v2.1.85 added a small but useful field to its hook configuration. Each hook entry can now carry an if value — a permission-rule string like Bash(git *) or Edit(*.ts) — that scopes the hook to only fire on tool calls matching the rule. Before this, every PreToolUse hook fired on every tool call. With if, you can register a Telegram-routing hook that only intervenes on git writes, while leaving file reads to flow through silently.

We exposed this in CodePulse via a new HOOK_IF_FILTERS env var: a JSON object mapping tool-event names to permission-rule strings. The implementation was three small changes. A parser for the env var. A filters parameter passed through to the hook-installer. A line in the buildHooksSection function that conditionally added an if key to the emitted JSON. Sixteen tests covering the parse, the build, and the integration. Two reviews. Ship it.

The wire format we got wrong

The CLI's hook configuration has two nesting levels. The outer level is a matcher entry — a JSON object with matcher and hooks keys. The inner level is the array of hook objects — each one has type, command, timeout. Two layers, two scopes.

The CLI v2.1.85 docs place the new if field inside the inner hook object, alongside type / command / timeout. Our implementation placed it at the matcher-entry level, alongside matcher and hooks. Visually, the difference is one indentation level.

The same JSON with two different nesting levels: the wrong one passes type checks but is invisible to the CLI parser

// What we shipped (wrong)
{
  "matcher": "",
  "if": "Bash(git *)",
  "hooks": [{ "type": "command", "command": "...", "timeout": 360 }]
}

// What the CLI evaluates (right)
{
  "matcher": "",
  "hooks": [
    { "type": "command", "command": "...", "timeout": 360, "if": "Bash(git *)" }
  ]
}

If the CLI rejected unknown fields strictly, this would have failed loud. It does not. JSON parsers tolerate extra properties by default, and the CLI's hook loader simply ignores anything it does not know how to interpret. Our if at the wrong level was syntactically valid JSON, parsed without complaint, and was thrown away during evaluation. The hook fired on every tool call. The user's filter did nothing.

The tests that confirmed the bug

We had eight tests covering buildHooksSection and eight covering the integration with installHooks. They were thorough on the surface. They tested the empty-filter case, the single-filter case, the multi-filter case, the non-tool-event drop, the absent-from-enabled-set drop. They asserted exact field values and compared structures.

Every single one read entry.if and asserted it equaled the filter string.

it('TAB-591: emits `if` field on PreToolUse when filter provided', () => {
  const hooks = buildHooksSection('cmd', DEFAULT_SET, { PreToolUse: 'Bash(git *)' });
  expect(hooks['PreToolUse'][0].if).toBe('Bash(git *)');
});

The implementation emitted if at the matcher-entry level. The test read if at the matcher-entry level. Both consistent with each other. Both inconsistent with the spec. The test confirmed the implementation, not the contract.

The deeper failure was that one test had a comment block declaring the layout as an invariant of the design. We had written, in plain English, "The if field sits at the matcher-entry level (sibling of matcher and hooks), not inside each hook object — matches the CLI v2.1.85 wire format." That comment was wrong. The next maintainer who reads it will internalise the wrong invariant before they read the spec.

How the bug surfaced

The catch was not from any test we had written. It was from an adversarial review that started by ignoring the diff entirely and reading the Claude Code hooks reference cold. Within the first paragraph the reviewer noticed the example layout did not match what our tests asserted.

Reviewer: I've been reading code.claude.com/docs/en/hooks. Their examples
put 'if' inside the inner hook object, not at the matcher-entry level. Your
HookEntry interface has it at the wrong level. All 16 of your tests pass
with the wrong layout because the tests are reading from the same wrong
position the implementation writes to.

That single paragraph turned what looked like a clean ship into a CRITICAL finding. We rewrote the implementation in fifteen minutes — moving the field one nesting level deeper. We rewrote the tests in another twenty — converting every assertion to read entry.hooks[0].if and adding a negative assertion that the entry itself does NOT contain its own if. The negative assertion is the key. It is the line that makes the regression to the broken layout impossible without flipping a red test.

We also rewrote the comment block. The new comment says the right thing about the layout, includes the docs URL inline, and starts with a one-sentence paragraph noting that an earlier version of the file had locked in the wrong layout in plain English alongside the assertions.

Why this class of bug is getting more common

The pattern is simple. Tests that are written from the implementation will pass on whatever the implementation does, even when it does the wrong thing. Tests that are written from the spec — from the docs, from the wire-format example, from the externally observable contract — fail until the implementation is right.

Manually-authored tests usually start from the spec because the human writing them is reading the docs. Tests generated by an AI assistant in the same session as the implementation typically start from the implementation. The model has the implementation in its context window and writes tests that exercise it. Both implementation and tests now derive from the same in-context model state, which may or may not match the external spec. If they do, you ship a working feature. If they do not, you ship a broken feature with a green test suite and an honest belief that the tests caught everything.

This is not a problem with AI-assisted testing per se. It is a structural issue with co-generation: when implementation and verification share an information source, the verification cannot independently catch errors in the source. The same issue exists for human pairs — the developer who writes the implementation and the tests in the same sitting often catches less than two developers working independently. AI-assisted workflows just amplify the problem because the in-context state is so coherent.

The countermeasure is the same as it has always been: separate the source of truth for the verification from the source of truth for the implementation. You can do this in several ways.

The patterns that catch it

A few habits make wire-format bugs surface earlier, regardless of who or what is writing the tests.

Negative assertions on the wrong layout. For every contract you can describe positively — "the field belongs in location X" — write the corresponding negative assertion that fails if the field appears in any other location. In our case, expect(entry).not.toHaveProperty('if') is the line that makes the wrong-level layout impossible to ship. The positive assertion is necessary but not sufficient on its own. The negative assertion is what locks the contract.

it('TAB-591: `if` MUST NOT appear at matcher-entry level (CLI ignores it there)', () => {
  const hooks = buildHooksSection('cmd', DEFAULT_SET, { PreToolUse: 'Bash(git *)' });
  expect(hooks['PreToolUse'][0]).not.toHaveProperty('if');
  expect(hooks['PreToolUse'][0].hooks[0].if).toBe('Bash(git *)');
});

Cross-reference the spec inline in the test file. Add the docs URL as a comment at the top of any test that locks in a wire format. The next person who reads the test should be able to verify the assertions against the canonical spec without leaving the file. We now keep a "wire-format reminder for readers" comment block at the top of our hook-installer test suite that names the docs URL and warns that an earlier version of the same comment block had got the layout exactly backwards.

Adversarial review that ignores the diff. Our review process now has two passes. The first pass looks at the diff, scoped narrowly to what changed. The second pass — the one that caught this — starts by re-reading the spec, the docs, or the wire format from outside the diff entirely, and only then compares against what shipped. The second pass costs more time. It catches more.

A two-pass adversarial review where Turn 1 reads the diff and Turn 2 starts from the spec, with the second pass catching what the first missed

Smoke tests against the real consumer. Unit tests against your code are necessary but not sufficient when you are writing JSON for a parser you do not control. Even one end-to-end smoke test that loads the emitted JSON in the actual CLI and observes the runtime behavior would have caught this bug instantly. We are adding one.

The tsc and the test suite both passing is a starting line, not a finish line. This is something we have been writing in our internal docs for months and clearly not internalising hard enough. Both of those checks confirm the internal consistency of the code. Neither of them confirms that the code matches the contract of the system it talks to. The wire-format checker for that is eyes on a side-by-side comparison with the docs, run by someone who is not the author.

What this would have cost in production

If we had shipped without the second adversarial pass, the failure mode would have been the most frustrating possible variety. The user sets HOOK_IF_FILTERS={"PreToolUse":"Bash(git *)"}. They restart CodePulse. The startup log says TAB-591: hook \if` filters active. They open .claude/settings.jsonand verify theif` key is there. Every positive observability signal confirms the feature is working.

Then they run a non-git Bash command. The hook fires anyway. They run a git command. The hook fires. They run a Read. The hook fires. The filter does nothing. They check the docs. They check the JSON. The JSON has the if field. Everything looks right except the runtime behavior.

The user files a support ticket. We look at their settings.json. The if field is there. We are now confused too. The thing we shipped has the right shape but the wrong semantics, and the only way to discover that is to know about the nesting level we got wrong in the first place. The bug would have been hard to reproduce, hard to diagnose, and hard to admit. It would have eroded trust in the feature long before it eroded our reputation for reliability.

That is the production-cost calculation. Not "would the bug have shipped?" but "how long would the bug have lived undetected, and how would users have framed it when they finally noticed?" In our case the answer was probably weeks, and "the filter doesn't work, this product is half-baked." Cheap to fix in code. Expensive to fix in user perception.

The shape of the fix going forward

We added two structural protections so this class of bug is harder to repeat.

The first is a checklist item in our adversarial-review prompt: for every wire-format change, the reviewer must confirm the layout against an externally-published spec, not against the diff. This is a process change, but it is documented in the review template, so the next reviewer cannot skip it without consciously deciding to.

The second is a code-review-time test policy: every test that asserts the presence of a field at a specific JSON location MUST have a paired negative assertion at all other plausibly-wrong locations. This is a structural change. It makes the cost of writing the wrong test slightly higher, in exchange for a much higher chance of catching the wrong implementation.

Neither of these changes will catch every wire-format bug. Wire formats are a moving target, and the CLI we depend on can — and will — change shape under us. But both of them shift the failure mode from "bug ships, user discovers it" to "bug is caught at code review time, before the commit lands." That is the trade we want.

The fix to the original bug, by the way, was small. Move one field from one nesting level to another. Update the type definition. Update sixteen test assertions. Add one negative assertion. Rewrite a comment block. About forty-five minutes of work. The cost of the catch was the same forty-five minutes plus the two adversarial-review rounds — call it ninety minutes in total. The cost of not catching it would have been measured in user trust and support load, neither of which trade well for ninety minutes.

If your CI uses a similar hook system — or any feature that emits JSON for a downstream parser you do not control — the question is worth asking from your own desk: do my tests confirm the implementation, or do they confirm the spec? The answer is rarely binary. Most teams have some of each. The honest exercise is finding the tests in the implementation-confirming category and either rewriting them against the spec, or pairing them with negative assertions that lock the contract.

It is a small habit. It is the difference between sixteen passing tests that are evidence and sixteen passing tests that are noise.

Ready to ship features without the false-confidence tax? Download CodePulse and let your phone be your second pair of eyes on every commit. The free tier includes the approval pipeline and zero-config installer. Upgrade to Premium to unlock AI commit review, the Genius Supervisor, and voice input.