Producers Defer. Reviewers Don't.

Here is a pattern I have stopped being able to un-see.

A build agent finishes a PR. It reports back: tests pass, here is the diff, here are a few things I noticed along the way that we should probably handle later. The “we should probably handle later” list is always specific. There is a TODO about a helper that’s still inlined three places. There is a note about an error path that returns the wrong status code on one branch. There is a comment about a missing tenant filter that “isn’t reachable from the new code path.” The agent files them as follow-up tickets in its summary.

I dispatch the code-reviewer.

The reviewer comes back with a verdict that overlaps about 70% with the build agent’s “for later” list, except it isn’t filed as a follow-up. It’s filed as fix this before merge. The helper should be extracted now. The error path should return the right status now. The tenant filter is one line and should be added now.

Then the kicker: half the time, the reviewer also adds two or three things the build agent never mentioned. It found a useEffect with a missing dependency that the implementer had stepped past. It noticed a test that asserts a side effect happened but doesn’t assert what the side effect was. It flagged a duplicated chunk of validation that’s already imported one level up.

The build agent watched all of this go by during the build. The reviewer is looking at the same diff. Why are they reaching different verdicts?

The pattern is the post

I have been running enough of these pipelines to call this a real pattern rather than a one-off observation. The build agent consistently wants to defer. The reviewer consistently wants to fix. The bias is not symmetric. The reviewer is not just a higher-resolution code reader — it is a structurally different reader.

A reviewer that catches things the implementer chose to defer is not surprising the second time you see it. The first time, you assume the implementer missed it. The second time, you assume the reviewer is being pedantic. The fifth time, you start asking what is actually going on.

What is going on is two effects, both of which a senior engineer will recognize from their own working memory, stacking in the same direction.

Effect one: context budget runs out

The build agent has been running for 30+ turns by the time it submits a diff. Its context window holds the spec, the brief, the file changes, the test runs, the failed attempts, the corrections, the reasoning chains. Some of that context is paged out and re-summarized. Some of it is just lossy.

Every additional decision the agent makes costs attention from a finite pool. The agent does not maintain perfect resolution on every detail of every hunk it produced thirty edits ago. By PR-submission time, the cheapest path through the loop is ship the thing, note the rough edges, file the follow-ups. That path consumes the least remaining attention. It is also the path that produces the most legible report: tests pass, here is the diff, here are the things to revisit.

A fresh-context reviewer is not running on a depleted attention pool. It is seeing the diff cold. The decision to extract a helper or fix an error path is not adjudicated against a budget that has been spent down for an hour. It is adjudicated against full budget on a small problem. The choice does not feel costly to the reviewer, because to the reviewer it isn’t.

This is a known property of long-running agent loops. It is the same dynamic that drives human engineers to declare features “done” on Friday afternoons that they would re-open on Monday morning. The budget matters.

Effect two: sunk cost in the producer

The other effect is harder to name, but I think it is real and possibly the larger of the two.

The build agent has 30+ turns of reasoning invested in the current shape of the diff. When it considers extracting a helper from three inlined call sites, what it is implicitly considering is unwind some of the decisions I made earlier, restructure, re-run the tests, re-validate the integration points. That is expensive to a loop that is trying to finish.

The reviewer has none of that investment. To the reviewer, the same change is add this small refactor. Same edit. Different cost-to-suggest.

This is a structural asymmetry, not a quality difference. The producing agent is not lazier than the reviewing agent. It is loaded with sunk cost. It correctly estimates that re-doing a hunk is more expensive for itself than for an outside reviewer. The reviewer is correct that the change is small. Both are seeing the same diff, weighing it from different positions, reaching different verdicts.

This is also the same dynamic that produces low-quality self-review in humans. You wrote it; you remember why it is shaped the way it is; you reconstruct intent and read past mistakes that an outside reader would not reconstruct. The agent version is the same, with the additional twist that the agent is also under context budget pressure that a human reviewer is not.

Why this makes independent review structural

The naive frame is that an independent reviewer is a “second opinion.” A check on the implementer’s work. A quality gate.

That frame is correct but undersized. The two effects above are not detected by quality; they are detected by position. No amount of self-discipline by the implementer eliminates them. The producer cannot un-spend its context budget. The producer cannot un-incur its sunk cost. The verdict the producer reaches on the deferred-vs-fix-now question is structurally biased toward defer, by construction.

The reviewer is the structural fix. Not the second opinion. The producer-reviewer pair is the smallest unit that can issue an unbiased “fix now” verdict, because the unbiased perspective on that question requires a reader without the producer’s history.

This is the same shape as several other patterns I’ve written about. The chunking problem is the same insight one level down — boundaries that let an agent reason about a problem are the same boundaries that decide what the agent can or cannot hold in working memory. The spec is the ceiling is the same insight one level up — every step downstream of the spec is interchangeable, including the reviewer. The reviewer is interchangeable; the requirement that it exist is not.

The team that wires reviewers into the loop as optional is going to ship a pipeline that drifts toward defer. The team that wires reviewers in as mandatory will ship a pipeline that handles its small problems before they compound.

Multiple reviewers, in parallel, on the same diff

The cheapest move once you accept the structural framing is to dispatch several reviewers on the same diff, in parallel, with different briefs. Code-quality reviewer. Security reviewer if the diff touches auth or input handling. Library reviewer if the diff touches your AI scaffolding. Each one is fresh-context. Each one has full budget on a small problem. The diff has multiple unbiased readers.

The cost is small. Reviewers are read-only. They have no conflict edges. They run in a single dispatcher message with multiple agent calls. Wall-clock time for the wave is the slowest reviewer, not the sum of them.

The verdict the orchestrator takes is the intersection of approvals. Any reviewer rejecting the diff sends it back to a build agent. No reviewer is overruled by another reviewer — they are looking at different risk surfaces and they have different biases. Their verdicts compose by intersection, not by majority vote.

If this sounds like over-engineering for a single PR, it is. For a single PR you should let the engineer review it. For a pipeline of agent-produced PRs landing every hour, the only thing that scales is the structural fix. You will not personally review 40 diffs in an afternoon. The reviewer wave does.

The counter-argument that does not land

The counter-argument I hear most often: better models will fix the deferral bias. The producer will get smarter about not deferring.

I think this is wrong on the mechanism, not on the trajectory. The deferral bias does not come from the model being dumb. It comes from the producer being in a position where deferring is structurally cheaper than fixing. Better models will not change that. They will be more capable on the dumb-mistake axis, and they will still find deferring cheaper at submission time than fixing, because the cost asymmetry is real.

You can shrink the magnitude of the bias. You cannot eliminate it by making the model better at the producer’s job. The thing that eliminates it is moving the verdict to a different position — a fresh reader.

This is the same shape as the human is the next bottleneck. You can push the bottleneck. You cannot make it not exist. The thing under the bottleneck moves; the structure of the loop is the same.

What this means for your workflow

If you are running an agent-driven pipeline today, three things follow.

One: stop letting your producer agents file follow-up tickets unsupervised. The follow-up ticket is a deferral. A deferral by a context-spent agent with sunk cost is not the same artifact as a deferral by a human senior engineer who has held the whole problem in their head. Treat producer-filed follow-ups as suggestions to the reviewer, not as decisions.

Two: make the reviewer mandatory on every non-trivial diff. Not optional. Not “when you remember.” Wired into the orchestrator loop, default-on, opt-out only for trivially safe changes (typos, comment fixes). The cost of running the reviewer wave on a small change is low. The cost of letting the deferral bias compound across a quarter is silent technical debt that nobody can point at because every individual decision was reasonable in isolation.

Three: dispatch reviewers in parallel, with different briefs. One reviewer is a quality check. Three reviewers with non-overlapping briefs are a structural input to the orchestrator’s verdict. The marginal cost is small. The marginal coverage is large. The verdict the orchestrator takes is the intersection of approvals.

The team that does this systematically is going to ship code that, on a hunk-by-hunk basis, looks indistinguishable from code written by a senior engineer who never deferred. The team that doesn’t is going to ship code that accumulates a layer of “we should clean this up later” decisions that nobody re-reads. Both teams will be moving fast. Only one of them will look back in six months and see code they trust.

What I am still working out

A few things I do not have a clean answer for yet.

Is the bias the same across different model families? I suspect it is, because the mechanism is about loop position, not model behavior. But I have not run a head-to-head.

Does the bias scale with diff size? My informal answer is yes — larger diffs have more context spent and more sunk cost — but I do not have numbers.

What is the right brief shape for the reviewer to specifically not defer? My current brief for code-reviewer ends with a line that says default to recommending fixes inline rather than as follow-up tickets unless the fix exceeds a one-hour rewrite. That line has reduced producer-aligned deferral verdicts noticeably. I do not know if there is a better version of it.

When I have answers, I will write them. For now: the producer defers, the reviewer doesn’t, and the team that wires that asymmetry into their pipeline ships a different quality of code than the team that doesn’t.

If you are running an agent pipeline and trying to figure out where the deferral debt is hiding in your workflow, book a call. I would rather you find it now than in a quarter.