We shipped CodePulse v2.3.120 on a Friday afternoon. The release contained the most thorough release-pipeline hardening we had ever landed: worker version injection, post-deploy /health verification, step-level timeouts, a 60-minute job ceiling, and a retry wrapper around every transient external call. Two adversarial-review rounds. Honest paranoia. Ship-ready.
A minute after the GitHub Release page lit up green, we ran the verification curl that was supposed to confirm the live worker bundle was the just-deployed one. The response came back {"status":"ok","timestamp":"…","version":"0.0.0-dev"}. The placeholder. The thing every TAB-595 change was supposed to make impossible.
We had not noticed something subtle: every workflow improvement we had merged for the past six weeks lived in our dev branch's release.yml, but no production release run had ever actually executed any of them. The release pipeline shipped binaries successfully, but the new logic — the parse checks, the NSIS resilience, the install-step retries, the worker version verify — had been silently bypassed every single time. This is the post-mortem.
The setup that lulled us
Our release pipeline is intentionally automated end-to-end. Conventional commits land on dev. A versioning.yml workflow runs on every push to dev, invokes semantic-release, decides the patch bump from commit types (fix: and feat: produce a patch — by versioning policy every release is a patch unless a major bump is manually approved), and creates the tag. A second workflow (release.yml) then takes the tag, checks out the code, builds the installer, signs it, uploads it to Cloudflare R2, deploys the worker, and creates a GitHub Release.
The handoff between the two workflows is the single line that broke us.
# .github/workflows/versioning.yml
- name: Trigger release workflow
if: steps.release.outputs.released == 'true'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
VERSION="${{ steps.release.outputs.version }}"
gh workflow run release.yml -f tag="v${VERSION}"
gh workflow run is a workflow_dispatch trigger. The reasoning at the time was pragmatic. We had previously tried on: push: tags, but tags created by actions/checkout's default GITHUB_TOKEN do not fire downstream workflows — that is GitHub's anti-recursion protection. To get around it, we'd have needed a personal access token with workflow scope, which we did not have rotated at the time. workflow_dispatch worked, did not need extra scope, and the comment we left in the file said so:
# semantic-release creates the tag, but GitHub doesn't reliably fire
# 'push tags' events for tags pushed with --follow-tags or via API.
# Direct workflow_dispatch trigger is the only guaranteed path.
The comment was correct when it was written. By the time we shipped TAB-595 it was wrong, and we did not notice. The personal access token scope had been rotated weeks earlier (in a separate ticket). The premise that justified workflow_dispatch no longer held. But the line of code remained, and it was carrying a tax we had not measured.
The rule that bit us
GitHub Actions has a documented rule that we had read, internalised intellectually, and then forgot the moment we needed it.
When you run a workflow using
workflow_dispatch, the workflow file used is the one on the default branch at the time of the call.
The default branch of our repository is main. The tag we passed to the dispatch was on dev. The workflow run that resulted checked out the code at the tag — that is, dev's source — but executed release.yml from main. Two different files, two different commits, two different sets of behaviour. The build was always running against current source code, but the build-driving workflow was always running from main's frozen release.yml.
We knew main was behind dev. Branch protection rules made it a deliberate gate. We just had no idea it was on the release path.
The frozen release.yml on main was last touched by a commit titled "chore: sync workflow files to main for workflow_dispatch support" — the very commit that established this pattern, two months earlier, when main and dev had been close enough to parity that the divergence was invisible. Since that sync, dev's release.yml had received roughly thirty commits' worth of improvements: a parse-check stage that would have caught two installer parse bombs in 100 milliseconds, a NSIS resilience pattern that handled a Chocolatey 503 outage, retry wrappers around every external install call, worker version injection, and post-deploy verification. Main's release.yml had received zero. Every release had been running an early-March copy of the file.
How we found it
The first sign was concrete and ten seconds long. Production https://api.codepulse.at/health returned the placeholder version string. The TAB-595 inject step was supposed to patch that placeholder before deploy. If the inject step had run, the response would have been the release tag. If the inject step had failed, the workflow would have thrown — release.yml had explicit throw "wrangler.toml is missing the WORKER_VERSION placeholder" guards on every assertion. So the inject step had neither run nor failed. It had simply not existed in the workflow that ran.
A gh api call against the run resolved the rest in one round trip:
$ gh api repos/4brainiacs/codepulse/actions/runs/24881389532 \
--jq '{path, head_branch, head_sha}'
{
"path": ".github/workflows/release.yml",
"head_branch": "main",
"head_sha": "84af75c7fea96ab0ef2e4e06a3453e14cfcf4f04"
}
head_branch: main. head_sha: 84af75c…. The workflow file run for tag v2.3.120 was main's release.yml, frozen at that early-March commit, with none of our improvements. The runtime build was current; the workflow logic that drove it was two months stale.
The shadow this cast was the part that hurt. TAB-588 — a parse-check stage that would have caught two consecutive installer parse bombs in milliseconds — had merged to dev and tagged but never run. TAB-592 — NSIS install resilience to absorb the Chocolatey 503 that blocked v2.3.116 — never ran. TAB-594 — install-step retries for bun install and npm ci and wrangler deploy — never ran. By accident of the npm and Chocolatey registries staying available during their respective deploy windows, none of these improvements being silently inert had produced a single failed release. The pipeline looked healthy because the pipeline had been quietly running on autopilot from a 60-day-old image.
Why workflow_dispatch makes this a class of bug
It is tempting to read this as a one-off failure of attention. It is not. The structure of workflow_dispatch makes drift between the default branch and the tagged branch a permanent risk. The actions/checkout@v6 step in your workflow file controls which code gets built. The dispatch trigger controls which workflow file drives the build. These two things resolve to different commits any time main and dev diverge — which is most of the time, in any team that uses a feature-branch flow.
The drift is invisible from the workflow logs. The build commands print correctly. The checkout step shows the tag's SHA. Every step succeeds. The only signal that something is wrong is the absence of steps that should have been there — which is exactly the kind of signal humans are bad at noticing, because it is noise-suppression rather than alerting.
Anything that happens only in the new workflow file — a new step, a new env var, a new guard — is invisible while it remains undeployed to main. You will not see "step skipped" or "step missing" in the logs. The step is not part of the file the runner read. It does not exist as far as the run is concerned.
The drift-proof fix
The right answer is not "remember to sync main more often." That is a process fix for a structural problem. The structural answer is to take main off the release path. We migrated to the rule GitHub Actions documents but rarely talks about explicitly: with on: push: tags, the workflow file used is the one at the tag.
# .github/workflows/release.yml — runs from the tagged branch
on:
push:
tags: ['v*']
workflow_dispatch: # kept as an emergency manual escape hatch
inputs:
tag:
description: 'Tag to release (e.g. v2.1.33)'
required: true
For an automatic semantic-release flow, on: push: tags resolves the workflow at the tag every time. The tag is created from dev. The workflow file at that commit is dev's latest. There is no main resolution involved. There is no drift surface. By construction.
Two preconditions made this safe to flip. The first was a personal access token with workflow scope, which we already had after a prior rotation — that is the credential that lets a CI-pushed tag trigger downstream workflows. The second was a one-time sync of main with dev for .github/workflows/ only. We did not need to merge code from dev to main; we just needed main's workflow files to be current so the manual workflow_dispatch escape hatch — the one we kept for emergency reruns — does not fall back to a stale view of the pipeline.
After flipping, our versioning.yml lost the trigger step entirely. The semantic-release tag push fires release.yml automatically. Drift between main and dev no longer matters for normal releases. A non-blocking drift-check step in the versioning workflow surfaces a warning if the two branches' workflow files diverge — useful for catching anyone who adds a new step on dev and forgets the manual-dispatch escape hatch.
Why we did not reach for "merge dev to main"
The obvious quick fix to our incident was a one-time merge of dev into main, then carry on with the workflow_dispatch pattern. We considered it. We rejected it for a structural reason that is worth naming.
A one-time merge does not change the failure mode. It resets the clock on a process that will drift again. The next time someone adds a CI step on dev — a parse check, a retry wrapper, a security scan — main will fall behind, and the next release will silently run without the new step. There is no signal that catches this. No CI failure. No alert. Just a workflow run that completes successfully without doing the thing the author intended.
The same thinking applies to our local-first architecture. When the structure of the system makes a class of failure recurrent, you fix the structure. You do not retrain people to remember the sync. The structure is what carries the load when you are tired, distracted, or new to the codebase.
What it cost us
We shipped four meaningful CI improvements that did nothing in production. Three of them — TAB-588, TAB-592, TAB-594 — were defensive. They protected against transient failures we had already seen. The fact that none fired during the 60-day window they were nominally active means we did not learn whether they worked. We had four months of accumulated belief that our pipeline was getting more resilient, none of which was grounded in evidence.
The fourth improvement — TAB-595's worker version verify — was the gate that surfaced the original problem. If we had been running dev's release.yml for v2.3.120, the post-deploy /health check would have failed the workflow loudly, with a clean error message naming the placeholder string and the expected version. Instead the workflow completed successfully, the binary uploaded, the GitHub Release published, and the only sign anything was wrong was a single character of mismatch between two version strings on a worker endpoint nobody routinely watches.
That is the part of the story that should sting any team running a similar pattern. The verification gate that would have caught the silent failure was itself silently bypassed. It is exactly the failure mode that motivated the gate in the first place — and the surface that delivered it ate it whole.
Patterns we are carrying forward
A handful of habits crystallised from this. None of them are subtle.
workflow_dispatchfor human-driven dispatch only. It is great for ad-hoc reruns, debugging, and one-off operations. It is structurally wrong as the primary trigger for an automated pipeline that is supposed to track the tagged commit.on: push: tagsfor tag-driven release flows, with a workflow-scoped PAT to bypass GitHub's anti-recursion when CI creates the tag.- Drift detection is a non-blocking signal, not a hard gate. The cost of a false positive (a tip-up to merge main from dev) is negligible. The cost of a false negative (a silent stale workflow) is what this post-mortem is.
- Identity-style verification gates run from the tag. A gate that proves "this deploy actually shipped X" must execute from a workflow file that reflects the same commit as X. Otherwise the gate is structurally identical to the surface it is supposed to verify.
The same structural-correctness instinct applies broadly across our approval pipeline and our hook system. When a class of failure is invisible by construction, fixing the structure beats fixing the process every time.
What to do this week if you run a similar pipeline
If your release flow uses workflow_dispatch to bridge two workflows, you have the same surface. The diagnostic takes one minute.
- Run
gh api repos/<owner>/<repo>/actions/runs/<recent_run_id> --jq '{path, head_sha, head_branch}'against your most recent release run. - Check whether
head_sharesolves to a commit you recognize as current. - Diff
.github/workflows/<workflow-file>.ymlbetween your default branch and your release branch.
If the SHA points to your default branch and your workflow file has changed on the release branch, you have the trap. Migration is small: a one-time sync of workflow files, a swap of the trigger from workflow_dispatch to on: push: tags, and a workflow-scoped PAT for the tag push if you do not have one already.
The cheapest moment to fix this is now, before your next post-mortem.
Ready to ship Claude Code releases without the silent-failure tax? Download CodePulse and let your phone be the second pair of eyes on every deploy. The free tier includes the zero-config installer and the approval pipeline that lets you supervise risky operations from Telegram. Upgrade to Premium to unlock AI commit review, the Genius Supervisor, and voice input.