Refactoring a 4,506-line MessageRouter without breaking behaviour

We measured the cost of adding a feature to CodePulse honestly for the first time after a Telegram bug — TAB-562 — turned a one-line orphan-callback fix into a 5-file change with seven follow-up commits. The number was painful. Median feature time was four to eight hours. Median files touched was five to seven. Median bugs surfaced in the next release was ten to twenty-five. The cost was not in any single component. It was in the seams between them, and almost all the seams went through one file.

That file was src/agent/message-router.ts. It was 4,506 lines long. It owned 19 ad-hoc Maps with hand-rolled TTL logic. Its dispatch was a 191-line switch statement. Its constructor took 17 positional parameters. It was not a god object on purpose; it had grown into one over twenty months as every new feature added "one more thing the router needs to know about."

This is the post-mortem on the decomposition that shipped under TAB-566 — a multi-phase Feature Framework that brought the router to ~800 lines, replaced the ad-hoc state with reusable primitives, and made the next ten features measurably cheaper to build.

Why the diagnosis took longer than the fix

The temptation with a 4,500-line file is to grep for the longest function and start splitting. We tried that on a Friday. By Monday the working tree had three half-finished module extractions, four broken tests, and a series of git commits with messages like "wip: extract approval bits" that none of us could explain a week later. We reverted everything. The lesson was that splitting a monolith you don't understand produces a different monolith you also don't understand, now distributed across five files instead of one.

The Friday-Monday whiplash made us slow down. We spent six hours just reading the router top-to-bottom and writing a one-paragraph note for every distinct concern we found. The notes were boring on purpose. They said things like "lines 940–1117 dispatch callback messages by action.type string match" and "lines 2200–2380 manage approval-card state across the held-HTTP pipeline and use the pendingRequests Map." Twenty-three notes, no architecture, no diagrams. Just a flat enumeration of what the file actually did.

The notes clustered into five concerns, none of which had been planned that way:

Approval pipeline state — pendingRequests, approval batching, deferred decisions, learning patterns, the seven Maps that tracked all of it.
Plan generation — slash-command intake, extractive vs CLI-native planning, plan execution coordination, four Maps for plan state.
Proactive intelligence — morning briefings, error watcher, replay, cost intel, eight callbacks for opt-in features.
Session lifecycle — bridge sessions, smart continuation windows, conversation history per Telegram chat, four Maps for session metadata.
Stop and MCP coordination — the Stop hook bridge, MCP heartbeat, watchdog state, two Maps for stop coordination.

That clustering was the actual decomposition. Five concerns, five future modules. Once we had the clusters, the question was no longer "how do we split this file" but "what would each cluster look like as a standalone module that uses the router only as a thin dispatcher?" The conceptual work took six hours and the resulting refactor took twelve days, in that order. The conceptual work was not the slow part. It was the part that made the rest fast.

Building the primitives before the modules

The primitives came before the modules — Phase 0 of the framework, three small files totaling about 250 lines. We wrote them first because every Map-shaped piece of router state would need to migrate, and migrating to a primitive that did not exist yet was strictly impossible.

ManagedMap<V> is a generic class with built-in TTL, automatic cleanup, size limits, touch() to reset TTL on access, and a drain() method for clean shutdown. It replaced 19 hand-written Maps that all expressed slightly different versions of "remember this thing for N minutes, then forget it." Every one of them had its own cleanup interval, its own size cap (or none), its own expiration semantics. Six of the 19 had memory leaks in production; one had a known unbounded-growth bug that had been pinned with a // FIXME comment for four months.

export class ManagedMap<V> {
  constructor(opts: { ttlMs: number; maxSize: number; name: string }) { ... }
  set(key: string, value: V): void { ... }
  get(key: string): V | undefined { ... }
  touch(key: string): void { ... }              // reset TTL on access
  drain(): IterableIterator<[string, V]> { ... } // for shutdown
}

CallbackRegistry replaces the 191-line switch statement at message-router.ts:940-1117. The old code had 32 cases, each one a string match on action.type, each one calling a method on the router by name. The new code is a typed handler map. Modules register their handlers at startup; the registry dispatches by lookup, not by switch. Compile-time safety on action types eliminates a class of bug where a typo in a callback name produced a silent no-op.

withSessionIsolation<T>(fn) is a single method on the persistent CLI bridge. It saves claudeSessionId, totalTurns, and projectDir, runs the wrapped function, and restores all three on exit — even if the function throws. It replaces three copy-pasted save/restore blocks in the bridge that had drifted apart over time. One of the three had been buggy for months: executeApprovedPlan did not restore projectDir on the error path, which meant a failed plan execution silently switched the bridge to the previous project's working directory. The bug surfaced as "Claude is editing the wrong files" in production. After Phase 0, that class of bug was structurally impossible to reintroduce.

A primitive stack diagram showing ManagedMap, CallbackRegistry, and withSessionIsolation as three small modules underneath the five feature modules

The primitives shipped before any feature module extraction. We migrated the 19 router Maps to ManagedMap one at a time, one Map per commit. Each migration was a two-line change. Test coverage stayed flat because the public API of each Map was unchanged — we replaced the implementation, not the contract. The CallbackRegistry rollout took longer (handlers moved one at a time over two weeks) and surfaced two real bugs along the way: a duplicate handler that had been silently overriding a callback for plan-mode resume, and a typo in a Telegram inline-button action name that had been logging "unknown action" warnings nobody noticed.

The seam choice for each module

With primitives in place, the actual decomposition was mechanical. Each feature module owned its concerns, declared its dependencies via a ServiceContext interface, and registered its handlers at startup. The router became a dispatcher and an orchestrator — it received a message, looked up which module handles it, and forwarded it. No more state, no more 32-case switch.

The crucial design decision in this phase was what counts as a module's public API. We adopted a strict rule: a module's public surface is whatever it registers with the dispatcher, plus its constructor signature. Nothing else. No public fields, no helper methods called from outside the module, no leakage of internal Maps. If the router needed to know something about an in-flight approval, it asked via a registered query handler — it did not reach into ApprovalModule.pendingRequests.

The rule produced one immediate friction we are still happy to pay for: it forced us to explicitly model the cross-module communication that had been implicit in the monolith. Plan execution coordinates with Approval coordination — both modules need to see when a plan needs the approval pipeline. In the monolith, this was a direct method call between two parts of the same class. In the modular version, it is an explicit requestApproval(planId, ...) query that the dispatcher routes from PlanModule to ApprovalModule. The new code is verbose. It is also auditable, which the old code was not.

Each module also declared its dependencies in the constructor as a typed ServiceContext rather than receiving 17 positional parameters. The context interface lists every external service the module touches — the Telegram client, the database, the persistent CLI bridge, the cost tracker. A new module declares which subset it needs. The router constructs a context with only the services it provides. The result: when we add a new module, the router does not need to grow a new constructor parameter. The 17-parameter constructor became a single ServiceContext injection.

export class ApprovalModule extends ModuleBase {
  constructor(context: ServiceContext) {
    super(context);
    this.pendingRequests = new ManagedMap({ ttlMs: 360_000, maxSize: 100, name: 'approval-pending' });
    this.batchQueue   = new ManagedMap({ ttlMs:  60_000, maxSize:  50, name: 'approval-batch' });
    // ... 5 more managed maps
  }
  registerHandlers(registry: CallbackRegistry): void {
    registry.register('approval', 'approve', this.handleApprove.bind(this));
    registry.register('approval', 'deny',    this.handleDeny.bind(this));
    // ... 5 more handlers
  }
}

The migration pattern that kept tests green

Every feature module extraction was a separate commit. Each commit moved one cluster's state and handlers out of the router and into a new module. The commits looked the same:

Create the new module file with the cluster's state migrated to ManagedMap.
Move the cluster's handlers into the module, register them via CallbackRegistry.
Delete the corresponding Maps and methods from the router.
Add an integration test that exercises one round-trip through the new module.
Run the existing test suite. Fix anything red.

The integration test in step 4 was the safety net. We had test coverage on the router as a whole, but the coverage was indirect — most tests exercised the router by sending a Telegram message and asserting the response. Those tests stayed green throughout because the public surface (Telegram in, Telegram out) was unchanged. The integration tests on the new modules were direct — they called the module's handlers with synthetic action payloads and asserted the resulting state changes. Direct tests caught two refactor bugs that the indirect tests missed: a handler that was registered to the wrong action type, and a stale closure that captured a now-stale reference to the router.

The order of extraction mattered less than we expected. We started with ApprovalModule because it had the most state (seven Maps) and we wanted to flush out primitive issues early. After Approval, the rest came out in roughly the order we had clustered them in the original notes. By the third module, the pattern was rote — extract, register, delete from router, test, ship. By the fifth, we were doing it by lunch.

A before/after of the MessageRouter showing the 19 Maps + 32-case switch + 17-param constructor on the left, and the dispatcher with 5 module imports + ServiceContext on the right

A note on the test suite that does not get said often enough: if the public surface of a refactor target is well-tested, the refactor itself does not need new tests for the existing behaviour. The thing that gets refactored is the implementation. The thing that needs new tests is whatever new shape the refactor exposes — in our case, the module APIs. Re-testing what is already tested through the public surface is wasted effort. Test the new seams.

The numbers we measured after

Six weeks after the framework rollout completed, we measured the cost of adding a feature again, using the same method as the baseline (median of the next ten features shipped after the measurement point).

The median feature time dropped to 1.5 hours from the baseline of four to eight. Files touched dropped to two to three from five to seven. Bugs surfaced in the next release dropped to two to four from ten to twenty-five. The MessageRouter ended up at ~800 lines from the original 4,506 — about an 82% reduction. The line count is the least interesting of these metrics; the cost-per-feature reduction is the metric that mattered to the business.

Three things changed in the per-feature workflow that account for most of the gain. First, the new module location is obvious from the feature description — an "Approval" feature lives in ApprovalModule, end of discussion. Second, the state lifecycle is handled by ManagedMap, which means a new feature does not write its own TTL/cleanup code. Third, the dispatch wiring is one line in registerHandlers instead of a new case in a 191-line switch with merge-conflict risk.

The merge-conflict reduction is its own story. In the monolith era, two engineers working on different features that both touched the router routinely produced merge conflicts in the switch statement, the constructor, or the Map declarations. Post-refactor, the same two engineers work on different files. Conflicts dropped from ~3 per week to ~0 per month. Pull-request review time dropped accordingly because reviewers no longer need to mentally diff a 4,000-line file with 200 lines of changes scattered across it.

When this kind of refactor is worth doing

The framework refactor cost 32–47 hours total, fully incremental, spread across multiple releases. The total cost is meaningful only against the cost of not doing it. Our baseline measurement of 4–8 hours per feature with 10–25 bugs per release was already showing exponential drift; another six months on the same trajectory would have made small features unaffordable.

The general rule we now apply: a structural refactor is worth doing when the trend in per-feature cost has visibly worsened over the last two quarters and you can name the structural cause. If per-feature cost is stable or improving, the refactor is premature optimisation. If you cannot name the structural cause — "I think the file is too big" doesn't count — the refactor will produce a differently-shaped problem. We had both: TAB-562 had given us a measured cost spike, and the read-through-and-cluster exercise had given us a named structural cause.

The corollary rule: do not start with the file split. Start with the read-through. The read-through is what tells you whether you have one concern stuffed into one file (refactor away) or five concerns stuffed into one file (refactor away in a different shape). The clusters that emerge from the read-through are the actual modules. The clusters that look obvious from the outside (alphabetical, line-count thresholds, "looks like X" hunches) are usually wrong.

The corollary's corollary: primitives before modules. The decomposition is structurally constrained by the data structures the modules will share. If you split before you have the shared primitives, every module will reinvent some version of TTL, dispatch, or session isolation. We avoided that by spending three days on three small files before any module extraction touched the router.

What the framework changed about how we think about features

A surprising second-order effect: features got smaller. Before the framework, a feature was an obviously expensive thing — five to seven files of new code, ten to twenty-five new bugs, a four-to-eight-hour planning session. Engineers responded by batching features into larger releases. Three small features became one bigger feature because the per-feature overhead made the small ones uneconomic.

After the framework, the per-feature overhead dropped enough that small features became viable on their own. Our release pipeline shows the effect: from v2.3.66 onward, average release size (in lines changed) dropped by about 30% while release frequency went up by about 50%. The same engineering capacity now ships more, smaller, more focused changes. That has improved our bug-bisection time measurably — when something breaks in a release that contains three small focused changes, the bisection is one of three; when it contains one big mixed change, the bisection is the diff-of-the-diff.

The framework also shaped how we think about adopting upstream features from Claude Code. Phase 1 of the refactor was a HookAdapter layer — a single file that normalises Claude Code's evolving hook protocol (PreToolUse, PermissionRequest, Stop, PermissionDenied) into typed internal events. Before HookAdapter, every router branch that handled a CLI version-specific behaviour had to know about Claude Code's wire format. After HookAdapter, the wire format lives in one file. When CLI v2.1.85 added the if field on hooks, we updated HookAdapter and shipped the downstream feature in two hours instead of two days.

What we would tell another team facing a similar monolith

Two patterns generalise beyond CodePulse.

Measure your baseline before refactoring. "This file is too long" is not a baseline. "Median feature time is six hours and rising" is. Without a baseline, the refactor's value is impossible to defend, both internally and against your future self when you wonder six months later whether it was worth it. We have the v2.3.69 baseline measurement in our docs/RELEASES.md precisely so we can re-measure if the framework drifts.

Read the file before you split it. The clusters that emerge from a careful read-through are not the same clusters you would have predicted from the outside. Our five-module split would have looked obvious in hindsight; nobody on the team predicted it accurately before the read-through. The read-through is the work. The split is the easy part once the read-through is done.

The TAB-566 epic took 32–47 hours of engineering time to ship in full. It pays back the first time a feature ships in 1.5 hours instead of 6, and the savings compound from there. If your team is feeling the same per-feature drag we were — long planning sessions, files everyone is afraid to touch, merge conflicts on the dispatch switch — the same kind of structural intervention is probably overdue. The cost is bounded. The savings are not.

Ready to ship features without the monolith tax? Download CodePulse and see what the post-framework codebase enables. The free tier includes the approval pipeline, the Telegram bridge, and the zero-config installer. Upgrade to Premium to unlock AI commit review, the Genius Supervisor, and voice input.