← Back to Blog

May 12, 2026

The Spec Is the Ceiling

aileadershipengineering-culturespec-driven-developmentagentic-workflows

Most engineering teams I talk to right now have everything they need to run a six-step agent-driven build pipeline. They have Claude Code or Cursor background agents. They have worktrees. They have a ticketing system with an API. They have review automation. The pieces are sitting on the desk.

Almost none of them have reorganized their work around what those pieces mean.

Here is the claim. Every step downstream of the stakeholder spec is now commoditized by AI agents. The agents are interchangeable, the ticketing platforms are interchangeable, the worktree mechanics are interchangeable. The only step still requiring human judgment shaped over time with stakeholders is the spec itself. So the quality of the spec you hand the first agent is the only ceiling that determines whether agent-driven engineering ships good software or just ships entropy faster.

This is a 2026 problem. The teams that don’t fix their spec process this year will spend the AI productivity dividend building the wrong things, and they will call it a tooling problem when it is an intake problem.

The pipeline is real and runnable today

Let me describe the pipeline I am actually running, today, on real engagements. Not a vision deck. The shape that ships work.

flowchart TD
    A[Stakeholder Spec<br/>human-shaped, conversation-driven] --> B[AI Engineering Agent<br/>iterates the brief]
    B --> C[Technical Brief<br/>scoped, testable, bounded]
    C --> D[PM Agent<br/>cuts tickets in Linear]
    D --> E[Engineering Agent<br/>breakdown plus dependency DAG]
    E --> F[Wave of Build Agents<br/>parallel, isolated worktrees]
    F --> G[Wave of Review Agents<br/>code-quality, security, build]
    G --> H{Verdict}
    H -->|pass| I[Merge]
    H -->|fail| F

It starts with a stakeholder, usually a founder or product owner, showing up with a problem. Not a feature. A problem. The conversation surfaces what flows exist, what data each flow captures, what the system is for, what the failure modes are. This is human work, structured but not yet shaped into anything a coding agent could act on. The output is a stakeholder spec, a document that captures intent at the level the business actually cares about.

Next, an AI engineering agent iterates that spec into a technical brief. The agent asks the questions a senior engineer would ask. Where are the boundaries? What are the non-functional requirements? Which existing patterns does this slot into? The brief that comes out is testable, scoped, and bounded. The human stays in the loop on judgment calls. The agent does the mechanical work of pulling the brief into a structure other agents can act on.

Then a PM agent reads the technical brief and cuts tickets into Linear, with acceptance criteria, dependency notes, and rough scope tags. The step that used to take a product manager half a day now takes a minute.

An engineering agent picks up from there and produces a breakdown plus a dependency DAG. It identifies which work can run in parallel, which work has conflict edges that require serial execution, and which tickets need to be split because they are still too coarse. The output is a wave plan.

The wave kicks off next, one build agent per ticket in the current wave, each in an isolated worktree, running in parallel. Each one drafts a PR against its assigned ticket. The orchestrator (still a human, but only barely) monitors progress and answers escalations.

Finally, a wave of review agents goes after the open PRs. Code quality, security, build hygiene, library hygiene. Each reviewer is read-only. Each runs on a single PR. The verdicts come back independently. PRs that pass move to merge. PRs that fail get kicked back to a build agent with the reviewer’s notes attached.

I run a version of this pipeline now. So do several of the teams I work with. The sidebar to this post (the wave-dispatched SDD engagement) walks through one of those runs with the actual numbers. The thing I want you to notice is that the pipeline is not aspirational. The tooling shipped. The worktree mechanics shipped. The agent dispatch shipped. The integration points are sharp.

Every step downstream of the spec is commoditized

Here is the part that should keep you up at night if you are a VP of Engineering.

The AI agents for software engineering work in steps two through six are interchangeable. Swap Claude Code for Cursor background agents and the pipeline still runs. Swap Linear for Jira and the pipeline still runs. Swap your worktree orchestration script for a different one and the pipeline still runs. The mechanics are commoditized. The agent that drafts the technical brief is not load-bearing on which agent does it. The PM agent is not load-bearing on which ticketing platform it writes to. The build agents are not load-bearing on which model they run.

This is the same observation that drove the post on clean architecture for LLMs: well-defined boundaries make components interchangeable. Boundaries that were ornamental five years ago are now the load-bearing part of the workflow. The reason the agents are interchangeable is that the integration points between them are clean enough that you can swap one without disturbing the others.

What is not interchangeable is step one. The stakeholder spec is shaped over time, in conversations the agents are not in, with people whose intent does not live in any model’s training data. The flows exist because someone decided they should. The data each flow captures exists because someone decided it should. The system is for something because someone in the room said so. None of that is downstream of compute. It is upstream of compute. The model can help shape the spec, but the model cannot generate the requirement the spec is shaping toward.

That asymmetry is the whole post. The agents are linear in cycles. Bad specs are exponential in rework. You can throw more agents at a bad spec and the only thing you accelerate is the rate at which the wrong thing gets built.

What a load-bearing spec actually looks like

Here is the shape of a spec that survives contact with the pipeline. Not the brief the agent writes in step two. The thing the human hands in at step one.

# Spec: Checkout Receipt Email

## What this is for
Customers who complete a purchase need a receipt they can forward to
accounting. Current state: they screenshot the order confirmation page.

## Flows in scope
1. Successful checkout, single item, single shipping address.
2. Successful checkout, multiple items, single shipping address.
3. Refund issued within 30 days of original purchase.

## Out of scope
Multi-address shipping. Subscription renewals. Gift orders.

## Data each flow captures
Flow 1+2: order ID, line items, totals, tax breakdown, shipping
address, customer email, fulfillment ETA.
Flow 3: original order ID, refund amount, refund reason, refunded
line items, partial vs. full flag.

## Non-functional
Email must arrive within 60 seconds of order confirmation. Receipt
HTML must render cleanly in Gmail, Outlook, Apple Mail. Plain-text
fallback required.

## Constraints we already know
We use the existing transactional email service. We do not introduce
a second one. Templates live in the email-templates package.

## What success looks like
A customer can forward the email to accounting and accounting can
file it without asking the customer for more information.

That is thirty-odd lines and it is enough for the rest of the pipeline to run. Notice what it does and does not contain. It does not specify the implementation. It does not pick the template engine. It does not write the SQL. It captures intent, flows, data, boundaries, and what done looks like. The agents in step two through six fill in everything else.

This is also the spec shape that survives the chunking problem I wrote about in Spec-Driven Development Works. The Hard Part Is Where You Cut It. A spec at this level of granularity is small enough that the engineering agent can break it into chunks that fit in working memory, and detailed enough that the build agents do not drift.

Wave dispatch is the easy part

Here is what step five looks like in practice. A wave of build agents, kicked off in a single dispatcher message, each with their own worktree and their own brief.

// Wave 1: three independent tickets, no conflict edges.
// Dispatched in parallel from the orchestrator.

await Promise.all([
  Agent({
    subagent_type: "engineer",
    description: "ENG-841 receipt email Flow 1",
    isolation: "worktree",
    prompt: `
      Implement ticket ENG-841 (receipt email - Flow 1).
      Spec: docs/specs/checkout-receipt.md
      Files in scope: src/email/receipt/*, src/email/templates/receipt-*
      Done when: tests pass, PR opened against main, ticket linked.
    `,
  }),
  Agent({
    subagent_type: "engineer",
    description: "ENG-842 receipt email Flow 2",
    isolation: "worktree",
    prompt: `
      Implement ticket ENG-842 (receipt email - Flow 2).
      Same spec, multi-item path.
      Files in scope: src/email/receipt/multi-item.ts and templates.
      Done when: tests pass, PR opened, ticket linked.
    `,
  }),
  Agent({
    subagent_type: "engineer",
    description: "ENG-843 refund receipt Flow 3",
    isolation: "worktree",
    prompt: `
      Implement ticket ENG-843 (refund receipt - Flow 3).
      Same spec, refund path.
      Files in scope: src/email/receipt/refund.ts and templates.
      Done when: tests pass, PR opened, ticket linked.
    `,
  }),
]);

The interesting thing about this code is that none of it is interesting. The dispatcher is twenty lines. The worktree isolation is a flag. The briefs are pulled directly from the engineering agent’s breakdown in step four. There is no special tooling here. There is no secret sauce. The mechanics are commoditized.

What is doing the load-bearing work is the spec the engineering agent read in step four to produce those briefs. If the spec said “build a receipt system,” none of the briefs above would write themselves. The wave dispatch only looks easy because the spec did the work upstream.

The review wave is also commoditized

Step six, the review wave, is even more clearly commoditized. Two reviewer briefs:

// Wave 2: review the three open PRs in parallel.
// Each reviewer is read-only and points at a single PR diff.

await Promise.all([
  Agent({
    subagent_type: "code-reviewer",
    description: "Code review PR #1247",
    prompt: `
      Review PR #1247 (ticket ENG-841).
      Focus: pattern alignment, test coverage, edge cases.
      Cap response at 200 words. Verdict: approve, request-changes,
      or block.
    `,
  }),
  Agent({
    subagent_type: "security-reviewer",
    description: "Security review PR #1247",
    prompt: `
      Review PR #1247 (ticket ENG-841).
      Focus: PII handling in email body, template injection risk,
      auth on the trigger endpoint.
      Cap response at 200 words. Verdict only.
    `,
  }),
]);

This is a thirty-line file. The reviewers do not know about each other. They start with no context from the orchestrator’s session. Their independence is what makes their verdicts worth anything. And the brief they need is small because the PR is small because the spec was small because the spec was scoped.

If the spec was sprawling, the PR is sprawling, the review brief is sprawling, and the reviewer’s verdict is mush. The pipeline degrades from the top.

The counterargument that does not land

The counterargument I hear most often is some version of: better models will eventually infer intent from a one-line ticket. Stop worrying about the spec, the model will figure it out.

I do not think this is right, but it is worth answering on its own terms. Even when models get better at inferring intent, the thing the business wants is not in the model. It lives in stakeholder conversations the model is not in. The model can infer intent from context the model has been given. It cannot infer intent that exists only in the head of the founder who has not had the conversation yet.

The bottleneck moves. It does not disappear. I made the same point from a different angle in the post on the human bottleneck: every delegation chain terminates at a human. You can push the bottleneck up the stack, and you should, but the higher you push it, the more expensive each remaining decision becomes. Pushing the bottleneck up to “shape the spec” is the right move. Pretending the bottleneck went away is not.

The team-throughput version

I wrote a while back that the 10x dev is dead and the 100x engineer is the one who makes everyone around them better. The agent-driven version of that claim sharpens. The 100x engineer in 2026 is the one who can shape a spec that ten agents can run against without drifting. Not the one who can write ten agents’ worth of code in a week.

This is the upstream version of the chunking problem. There, the question was where to cut a feature so the build agents do not lose context. Here, the question is what to put in the spec so the cut points are visible at all. Same shape, one layer up. The work compressed. The deciding did not.

The case study sidebar to this post (wave-dispatched SDD engagement) gets concrete on what the wave throughput and rework numbers look like when the spec is sharp. I am not going to repeat the numbers here. The point of the case study is to show that the asymmetry is measurable. A good spec accelerates the wave. A bad spec multiplies the rework.

What to do on Monday

If you are running a 10 to 100 person engineering org, here is the question to walk in with on Monday: what does your spec process actually look like, before any agent or any ticket gets involved? Who writes the stakeholder spec? Who reviews it? How does it get from stakeholder intent to something an AI engineering agent can iterate on?

If the honest answer is “we write a Jira ticket and call it a spec,” you do not have a spec process. You have a ticketing process. Those are different.

The teams I am seeing pull ahead this year are not the ones with the fanciest agent stacks. They are the ones who treat the stakeholder spec as the load-bearing artifact it actually is. They have a named person who owns the spec shape. They have a review pass before anything goes to an agent. They have a feedback loop from the build wave back into the spec template, so next quarter’s specs are sharper than this quarter’s.

That is the only ceiling left worth working on. The agents will get cheaper, the tooling will get tighter, the integration mechanics will keep commoditizing. None of that decides whether your team ships good software in 2026. Your spec process does. The pipeline downstream of it is interchangeable. The spec is not.

If you are a CTO or VP of Engineering trying to figure out how to fix your intake before the agent dividend gets eaten by entropy, book a call. I would rather you fix the ceiling than spend another quarter shipping the wrong thing faster.