Spec-Driven Development Works. The Hard Part Is Where You Cut It.

Two weeks ago I helped a client engineering team switch to Spec-Driven Development. The numbers from the first week: test coverage up 40%, roughly 80% of a new feature shipped from a single PRD with minimal engineer intervention.

That isn’t the story. The story is what showed up in week two: a bottleneck nobody on the team had a name for, sitting exactly where they’d just freed up the most capacity.

The bottleneck is figuring out where to cut the work. The bottleneck is chunking.

And one thing the receipts won’t tell you on their own: most of why this worked happened before week one.

Before

The team had the workflow most product engineering teams have. A ticket lands. An engineer picks it up. They build, they push, somebody reviews, it ships. Product and design renegotiate inside the build window — sometimes politely in PR comments, sometimes via a Slack DM that starts with “quick question”.

The engineers were absorbing the ambiguity. They always do. That absorption was the slow part of the cycle, but because it lived inside engineering’s slice of the process, the rest of the org filed it under “that’s just how long building takes”.

Standard story. I see this exact shape in most engagements.

What we’d already done

What I haven’t said yet is that this team had spent the previous quarter making the codebase ready for this. Not as part of an SDD initiative — they didn’t know SDD was coming. As regular cleanup work that any engineering team prioritizes when they have the runway: a refactor of three duplicated email service implementations into one, a consolidation of the auth flow down to a single canonical version, the removal of a layer of legacy middleware that nobody could explain anymore.

That was on top of a skill library they’d been building all year. Named, reusable patterns the team referred to by name in PRs and in standups. When a PRD said “validate this with the auth pattern,” there was exactly one auth pattern. The AI knew which one. So did the engineer reviewing the PR. There was no ambiguity to absorb.

The seams that let an AI hold a chunk in working memory are the same seams that let a human do it. SDD doesn’t create those seams. It rewards them. The architectural prep work isn’t ornamental — it’s the precondition that decides whether your week-one numbers look like ours.

The honest version of this story is that the team wasn’t fast at SDD because they switched workflows. They were fast because the code was ready and they switched workflows. Either alone underperforms.

What we changed

The change wasn’t dramatic in shape. We moved to PRDs first — a real PRD, not a Notion doc with three bullet points and a Loom — and routed the implementation through an AI coding workflow. Engineers stopped writing first-draft implementation. They reviewed PRDs, they reviewed AI-generated PRs, they integrated, they fought the ambiguous bits.

This change was cheap because the code was already in shape for it. If you’re starting from a tangled codebase, the workflow change alone won’t reproduce these numbers; read the previous section as the bill that came due first.

If you’ve read why I think AI adoption isn’t actually stalled, this is the team-level version of that argument. The org-level claim is that your AI pilot stalled because your decision pipeline didn’t change. The team-level claim is what happens when you do change it: the decisions move out of the code, the build phase shrinks, and a different problem appears in their place.

I’m staying tool-agnostic on what we used. The interesting thing isn’t which AI agent. It’s what changed about the loop.

The first week: receipts

Two numbers came out of the first week.

Test coverage went up 40%. Not because anyone made a coverage initiative. The PRDs specified behavior at a level of detail that made tests writeable as a side effect of implementation. The AI agent wrote the tests because the spec told it what the function should do. Engineers stopped deferring tests because there was nothing to defer — the test was part of the chunk, not a separate ticket.

Roughly 80% of a new feature shipped from a single PRD with minimal engineer intervention. Engineers were involved in the boundaries, the integration points, and a handful of decisions that hadn’t been pre-resolved in the spec. They weren’t writing the lines. They were directing.

I’d seen these kinds of numbers in demos and thought “sure, on a toy”. Watching them hold up against real product work changed my read.

The second week: where the workflow stalled

Then we hit week two.

The team kept the cadence going on small, well-shaped work. PRDs that targeted a single endpoint, a single migration, a single component family — those flew. Engineers were starting to push back on the spec-writing time as feeling slower than just opening the editor.

Then a feature came up that was bigger. A real feature. Cross-cutting, touching three subsystems, with a handful of policy decisions baked in. The team wrote one PRD for the whole thing.

The AI started strong, but as the implementation moved across subsystems it began contradicting earlier decisions. It re-litigated function signatures it had defined three steps back. It introduced patterns that didn’t match what existed elsewhere in the codebase. The engineer reviewing it spent the day correcting drift instead of integrating.

The team’s response was to split the PRD. They re-spec’d the feature as four separate chunks. That worked. Each chunk shipped clean. But the team had spent two days arguing over where to draw the lines. Somebody said it out loud in standup: “we shipped this in a day but spent two arguing over where to split the spec”.

That was the moment. The work compressed but the deciding didn’t.

The two failure modes

What we found, repeating, was a narrow band where SDD works and two failure modes outside it.

Chunk too big: the AI loses context partway through. It stops being able to hold the whole problem in working memory. Earlier decisions get re-decided, patterns drift, the integration cost on review eats whatever speed you gained.

Chunk too small: the spec-writing overhead exceeds the implementation cost. By the time an engineer has written a spec precise enough for the AI to act on, they could have just opened the file and made the change. The team spent more time documenting intent than building.

The interesting thing is that the working band — the chunk size where SDD pays — is narrower than I expected and shifts depending on the problem. Cross-cutting changes have a smaller working band than localized ones. Greenfield code has a bigger band than touching legacy. Code with strong existing patterns has a bigger band than code without.

This is the same insight that shows up one layer down, in the architecture itself: explicit boundaries are what AI needs to function. Chunking is the same problem at the workflow layer. A chunk is a boundary. Where you cut it is where the AI can reason about the whole problem.

There’s a third failure mode underneath both of these, and it doesn’t care about chunk size. If the codebase doesn’t have stable patterns, named conventions, or clean seams, no chunk works: you either drift inside a big chunk because there’s no anchor for the AI to reason from, or you burn cycles re-explaining context inside small ones. Chunking only matters once the code is ready to be chunked.

You might read this as “AI is doing 80% of the work, so we need fewer engineers, or we can hire less-senior ones to handle the easy 20%”. That isn’t the claim. The 20% didn’t get smaller. It moved upstream, and the stakes on each call went up. The senior engineer who used to spend 80% of their week implementing now spends that 80% deciding where to cut. They’re not less needed. They’re applied to a problem with a much bigger blast radius — get the chunk wrong and you lose a day of AI drift; get it right and a feature ships from a single spec.

What this actually means

Two weeks isn’t long. Here’s what we know and what we don’t.

What we know: SDD ships real product work. The 40% / 80% numbers happened on a real engagement, on real code, in week one. They held up in week two on the parts of the work where the chunks were sized right.

What we don’t know: a method for sizing chunks correctly on a feature you’ve never built before. We have hunches. Localized changes have a working chunk size somewhere around “one PRD per file family”. Cross-cutting features want to be split before you write the PRD, not after. Greenfield code rewards bigger chunks than legacy. None of these are a method. They’re patterns we keep noticing.

There’s another misreading to head off, the one a CTO might leave with on a Monday after seeing the 40% / 80% numbers: “great, we’ll switch to PRD-first next sprint.” That isn’t the claim. You’re 80% of the way to those numbers if your code is already in shape; if it isn’t, the workflow change will surface every weakness in your codebase under load, faster than you can patch them. The team that figured out SDD wasn’t lucky in their tools. They had spent the previous quarter putting the codebase in a state where SDD could even run.

The lesson isn’t “do SDD”. The lesson is that SDD works the way good tooling always works. It eliminates the easy version of a problem and reveals the hard version underneath. The hard version of “build a feature” used to be “write the implementation”. It’s now “decide where the boundaries are”. That decision used to live inside engineering’s slice of the cycle, hidden inside the implementation work. Now it lives in front of the cycle, in the open, requiring an answer before any code gets written.

This is what your senior engineers should be doing. It’s also what most teams have spent the last five years not training engineers to do.

Where this leaves us

We’re still working on a chunking heuristic worth shipping as a method. We also don’t have a clean way to tell a team their codebase isn’t ready yet, beyond the obvious smells: no clear module boundaries, no shared patterns, lots of legacy crud the team has stopped seeing. When we have either, I’ll write it. For now: SDD works on the chunks you size right, in code that’s ready to be chunked, and the team that figures out both faster will outpace the team that just writes more PRDs.

If you’re a CTO or VP of Engineering thinking about running SDD on a real team and wondering how to think about chunking from day one, book a call. I’d rather you hit week two without spending two days arguing over a spec.