Managing AI Coding Agents: The Delivery Lead Guide

A delivery lead supervising multiple AI coding agents working in parallel.
  • The Operator Shift: Managing AI coding agents isn't scaled code review. It's a shift from managing engineers who write code to operating a fleet of systems that do.
  • The New Bottleneck: Code generation is infinite. Your team's true bottleneck is now human review bandwidth and verification capacity.
  • System Governance: Context is carried explicitly in files like AGENTS.md, treating guidelines like versioned code rather than tribal knowledge.
  • Measuring Outcomes: Success is no longer measured in raw velocity, but in verified value safely shipped to production.
  • Scaling Constraints: Governance must be explicit—stand up your registries, test gates, and audit trails before trying to run parallel fleets.

Welcome to the definitive guide on managing AI coding agents. Your best engineers have quietly changed jobs.

They are no longer writing most of the code — they are dispatching autonomous agents that write it for them, in parallel, often overnight.

Yet most delivery organizations still manage this with habits built for human pull requests: one reviewer, one branch, one change at a time.

The result is the failure pattern now showing up across the industry — pilots that dazzle in a demo and then stall before they ever reach production. This guide gives PMO directors and engineering leaders the operating model to fix that.

The Operating Shift: From Writing Code to Managing Agents

For two decades, software delivery optimized one thing: the throughput of humans writing code. Tooling, ceremonies, and metrics all assumed a person at a keyboard.

Agentic coding breaks that assumption. An engineer can now describe a task, hand it to an agent, and walk away while it plans, edits, runs tests, and reports back.

The work still happens — but the human role moves up a level. It means directing, reviewing, and governing autonomous software rather than producing the output yourself.

You set the task, define what "done" looks like, scope what the agent may touch, and verify the result. The clearest description comes from practitioners already living it.

Engineers increasingly run several agents at once and direct them the way a delivery lead oversees a team. The skill is no longer typing — it is orchestration and judgment.

Why this is not scaled code review

Treating agent management as "code review, but more of it" is the single most expensive mistake in this space. Human review assumes human-shaped mistakes.

Agents fail differently — confidently, at volume, and in ways a tired reviewer waves through. Manage the system, not the diff.

That means controlling what enters the pipeline (specs and guardrails) as deliberately as what exits it (review and gates).

Expert Insight The teams that scale agents fastest treat them as managed production systems, not as smarter autocomplete. The defining trait of a mature operation is not the number of agents running — it is the operating model governing them. Agents are onboarded, versioned, and supervised, not improvised.

The Counter-Intuitive Truth: Why More Agents Make Most Teams Slower

Here is the misconception that quietly sinks programs: that adding agents adds output linearly. Ten agents, the logic goes, should ship roughly ten times the work.

The data says otherwise. Industry research from IDC and Forrester (replicated by a16z and the MIT Sloan CIO panel) converges on a stark figure: roughly 88% of AI agent pilots never reach production.

Other 2026 enterprise studies put the production-at-scale rate as low as 11–14%. The failure is almost never the model.

It is the layer around the model — evaluation, governance, integration, and ownership. Forrester's blocker breakdown is telling: evaluation gaps (64%), governance friction (57%), and model reliability (51%).

The new bottleneck is your review bandwidth

When agents generate code faster than humans can verify it, throughput collapses at the review step. Every additional agent without additional verification capacity adds risk, not speed.

This is the insight most "10x productivity" claims skip: the constraint has moved from writing code to trusting it.

The organizations that win don't run the most agents — they industrialize verification so review stops being the choke point.

PMO Warning Counting pull requests or lines of code to prove agent ROI will mislead your steering committee. Agents inflate both effortlessly. Measure verified outcomes shipped to production per unit of human review — the metric that actually moves when the operating model is working.

Running Coding Agents in Parallel

Parallelism is where the promise — and the chaos — begins. The headline demo of the year made the ceiling vivid.

At Google I/O 2026, a fleet of 93 sub-agents working in parallel built the core of an operating system in under 12 hours for under $1,000 in compute.

Most teams don't need 93 agents. They need two or three that don't trip over each other. That requires isolation and disciplined task decomposition.

Isolation: give every agent its own lane

The practical pattern is one agent per isolated workspace — typically a separate git worktree or branch — so agents never write into the same files simultaneously.

Tooling has standardized on this; Google's Antigravity, for example, added native worktree and project support specifically to manage many agents at once.

Decomposition: parallel work needs parallel-safe tasks

Agents parallelize well only when tasks are genuinely independent. Slice work so each agent owns a self-contained unit with clear boundaries.

Overlapping scope produces merge conflicts and "context drift," where agents diverge from each other's assumptions.

For the full mechanics — worktree setup, conflict avoidance, and where parallelism stops paying off — see our deep dive on running parallel coding agents.

Pro Tip Don't scale agent count to your compute budget — scale it to your review capacity. A useful rule of thumb: add a parallel agent only when you can answer "who verifies its output, and how fast?" If the honest answer is "me, eventually," you've found your real ceiling.

The Asynchronous Agent Workflow (the Agent Smith Model)

The deeper shift behind the news is asynchrony. The cluster of searches around Google's internal "Agent Smith" points to it.

It acts as an agent that runs tasks in the background, that you can assign and check on from your phone, working while you're offline.

Agent Smith builds on Google's Antigravity platform, whose Manager surface is purpose-built to spawn, orchestrate, and observe multiple agents working asynchronously.

Queue a task, walk away, get notified when it's done. Fire-and-verify replaces sit-and-watch.

Designing for "while you sleep" work

Synchronous coding kept a human in the loop continuously. Asynchronous agents invert that: you dispatch work, then verify it after the fact.

Your job becomes writing crisp tasks up front and reviewing evidence later — not babysitting a cursor. This is why platforms now emphasize verifiable artifacts.

Async only pays off if blocked agents fail safely and finished agents notify the right person. Build queues, escalation paths, and completion alerts into the workflow early.

We break down the full async pattern — queuing, blocking behavior, and post-hoc review — in our guide to asynchronous AI agent workflows.

Reviewing Code the Agent Wrote

When agents produce code at volume, the review process designed for occasional human PRs breaks.

Reviewer fatigue is not a soft concern here — it is the failure mode that lets a confident, wrong change reach production.

Agent-specific failure modes

Agents introduce risks human reviewers aren't primed to catch: plausible-looking code that uses outdated API conventions, and silent security gaps.

Subtly wrong assumptions are often stated with total confidence. Notably, 70% of leaders name non-deterministic output as the top production-readiness barrier.

Gates, not glances

The fix is to stop relying on a human reading every diff and start building automated gates: required tests, security scans, and policy checks.

The agent's code must pass these before a human even looks. Reserve human attention for judgment, not mechanical checking.

Our dedicated walkthrough covers a defensible AI agent code review process — including which checks to automate and what a human must always verify.

Compliance Note Agent-written code can act with real credentials on real customer data. Security reviews must explicitly cover agent permissions, tool-access scope, and data handling — 62% of practitioners already rank security as a top deployment challenge. Document who approved each high-stakes change; "the agent did it" is not an audit answer.

AGENTS.md — The Operating Manual Your Fleet Reads

Every coding agent starts each session blind to your project's specifics. It knows Python in general but not that your team uses a particular toolchain.

The industry's answer is AGENTS.md — an open standard launched jointly by Google, OpenAI, Cursor, Sourcegraph, and Factory.

It is now read automatically by Claude Code, Cursor, Copilot, Codex, and Gemini. Think of it as a README written for agents instead of humans.

What belongs in it

The high-signal sections are build and test commands, code-style conventions, non-negotiable constraints, and domain gotchas.

This contains the tribal knowledge senior engineers carry in their heads. Place commands early; agents reference them repeatedly.

The counter-intuitive part: more context can hurt

Research by Gloaguen et al. (2026) found that bloated, machine-generated context files actually reduced agent task success while raising inference cost by over 20%.

A tight, human-authored file of genuine constraints beats an exhaustive auto-generated one. Treat AGENTS.md like code — concise, reviewed in pull requests, and version-controlled.

Our practical template and section-by-section breakdown lives in the AGENTS.md file guide.

Human-in-the-Loop: Where to Place the Gates

Human-in-the-loop is the dial between speed and safety — and most teams set it wrong in one of two directions.

Too many approval gates and the agent stalls, waiting on humans for trivial changes; you've rebuilt the bottleneck you were trying to remove.

Too few and an autonomous agent ships a disaster with real consequences. You must match the gate to the stakes.

The workable pattern is graduated autonomy: let agents act freely on low-risk, easily reversible work, and require explicit human sign-off only for high-stakes actions.

Deciding where each gate sits — and how to avoid approval bottlenecks — is covered in detail in our piece on human-in-the-loop approval gates.

Pro Tip Write your gate policy as a one-page list of "actions an agent may never take without a named human approver." If that list is empty, you have no governance. If it's the whole backlog, you have no agents. The right answer is short, specific, and reviewed quarterly.

Governing the Fleet: Agent Registries & Coordinated Rollouts

At small scale, ad-hoc agents are fine. At fleet scale, the absence of a control layer is exactly what kills programs.

IBM projects enterprises will soon run on the order of 1,600 agents each — and roughly 70% admit they can't govern the ones they already have.

Only about one in five has a mature governance model for autonomous agents. An agent registry is the fleet's source of truth.

It tracks which agents exist, what each is permitted to do, and which version is running where. It turns a swarm of improvised scripts into a governed system.

Treat agents like deployable software. Version agents, roll out updates deliberately, and keep the ability to roll back a bad version instantly.

The mechanics of versioning, rollback, and coordinated rollouts are in our guide to the AI agent registry and rollouts.

PMO Warning Gartner estimates over 40% of agentic AI projects are at risk of cancellation by 2027 — and the common thread among the casualties is governance built as an afterthought. Stand up the registry, gates, and audit trails before you scale agent count.

The Skills Shift for Delivery & Engineering Leaders

If the unit of work has changed, so has the job. The skills that defined a strong engineering manager in 2024 don't disappear, but they stop being the differentiators.

The high-leverage skills now are task decomposition, specification writing, verification design, and governance.

The leader who can define crisp success criteria and industrialize review outperforms the one who can still out-code everyone.

Output per human-review-hour becomes a more honest measure than story points when capacity planning with AI.

We map the full transition — including how to coach engineers who now run agents — in our guide to the engineering manager AI skills shift.

Putting It Together: A 90-Day Operating Model

You don't roll all of this out at once. The sequence that works mirrors the way the data says winners scale: governance first, agent count second.

Days 1–30 — Establish the rails. Write a tight AGENTS.md, define your gate policy, and scope permissions. Pick one well-bounded task and run a single agent.

Days 31–60 — Industrialize verification. Automate the checks an agent must pass before human review. Stand up a lightweight registry.

Days 61–90 — Scale and govern. Introduce async workflows for overnight work, add audit trails, and formalize change management for model upgrades.

The complementary product-leadership perspective — managing agents as synthetic team members inside the product org — is covered in our companion pillar on the rise of agentic product management.

About the Author: Sanjay Saini

Sanjay Saini is a Senior Product Management Leader specializing in AI-driven product strategy, agile workflows, and scaling enterprise platforms. He covers high-stakes news at the intersection of product innovation, user-centric design, and go-to-market execution.

Connect on LinkedIn

Frequently Asked Questions (FAQ)

What does it mean to manage AI coding agents?

Managing AI coding agents means supervising autonomous software that plans, writes, and tests code — directing, reviewing, and governing its output rather than typing it yourself. The role shifts from individual contributor to operator: setting tasks, defining guardrails, and verifying results across a fleet.

How is managing AI coding agents different from managing developers?

Agents work in parallel, asynchronously, and without instinct for scope or risk. Unlike developers, they don't retain your context between sessions, so guidance lives in files like AGENTS.md. You manage throughput and verification capacity, not careers — and your review bandwidth becomes the real constraint.

How many coding agents can one engineer supervise at once?

There's no fixed number; it depends on task complexity, review capacity, and how much verification you've automated. Teams typically start with two or three and scale as guardrails mature. The ceiling is human review bandwidth, not compute — past it, quality drops fast.

What skills do delivery leads need to manage AI coding agents?

Task decomposition, specification writing, verification design, and governance — plus the judgment to know when to trust output and when to gate it. The ability to define success criteria and review at scale matters more than being the fastest coder in the room.

Do AI coding agents replace software engineers?

Not cleanly, and not yet. Agents handle well-scoped, verifiable tasks but still need humans for architecture, judgment, and review. The role is changing more than disappearing: engineers increasingly direct agents like a delivery lead oversees a team, shifting effort from writing to supervising.

How do you measure the productivity of AI coding agents?

Measure shipped, verified outcomes — not lines of code or PR volume, which agents inflate easily. Track cycle time, change-failure rate, rework, and review throughput. The honest metric is value delivered to production per unit of human verification, since review is now the bottleneck.

What is the "agent manager" operating model?

It's running coding agents as a managed production capability: defined tasks, versioned instructions, approval gates, observability, and a registry of which agents do what. Borrowed from delivery management, it treats agents as team members needing onboarding and supervision rather than one-off tools.

How do you onboard a new AI coding agent to a codebase?

Give it a concise AGENTS.md with build and test commands, conventions, and do-not-touch rules; scope its permissions; and start with a small, verifiable task. Run it through explore-plan-code-verify, review the evidence, then widen autonomy as it proves reliable — much like onboarding a new hire.

What are the biggest risks of letting agents write production code?

Silent security flaws, non-deterministic behavior, and reviewer fatigue that lets bad changes slip through. Without permission scoping and approval gates, an agent can act with real credentials on real data. Most pilots fail here — governance, not model quality, is usually the deciding factor.

How do you set up governance for a fleet of coding agents?

Start with an agent registry (who does what, on which version), permission scoping, approval gates for high-stakes actions, and audit trails. Add change management for model upgrades and clear ownership. Only about one in five organizations has this maturity today — which is why most stall.