Why most AI pilots never reach production.
Companies are spending heavily on AI and getting demos in return. The gap between a pilot that impresses and a system that runs the business is the most expensive problem in enterprise AI right now — and it is almost never the problem people assume it is.
Most companies now have an AI pilot. Far fewer have an AI system running in production. Almost none can point to one that touches a real workflow, takes real actions, and has changed how the business operates.
The instinct is to blame the model. Not capable enough, not reliable enough, not ready yet. But when researchers go and look at what separates the projects that scale from the ones that quietly die, the model is not where the answer lives.
The pilots are not failing because the AI cannot think. They are failing because nothing safely connects what it produces to what the business actually does.

The number everyone quotes
The most cited evidence here comes from MIT's NANDA initiative, whose 2025 report The GenAI Divide: State of AI in Business concluded that roughly 95% of corporate generative-AI pilots delivered no measurable impact on profit and loss, while only about 5% produced rapid revenue gains. The study drew on around 150 leader interviews, a survey of 350 employees, and an analysis of 300 public deployments, set against an estimated $30–40 billion in enterprise AI spending.1
That figure is worth treating carefully. It went viral, and it has been challenged — some analysts argue it measured a narrow definition of return and that the headline overstates outright failure. The exact percentage is debatable. But even the skeptics tend to agree on the shape of the thing: most pilots get stuck somewhere between an impressive demo and a system the business depends on. Gartner has projected, from a separate dataset, that a large share of agentic-AI projects will be scrapped before reaching production. Two different methods, the same conclusion.
What makes the MIT work useful is not the number. It is the diagnosis. According to its lead author, the failures were less about the quality of the underlying models and more about how organizations tried to use them. Generic tools impress in a demo and then stall inside a company, because they do not adapt to that company's actual workflow. The efforts that succeeded did the opposite: they picked one real problem, executed it well, and partnered with someone who could deploy it properly.
Pilots operate the model in a safe environment. Production requires the layer beneath it — the one most pilots never build.
The same gap, seen from the inside
We have written before about the shift from AI that thinks to AI that acts, and the layer that shift exposes. The pilot-stall problem is that same gap, viewed from the other end.
On one side sit capable models that can reason and generate actions. On the other sit real business systems — CRMs, inboxes, databases, published channels — that cannot afford mistakes. Between them, an entire layer is usually missing: approvals, permissions, workflow context, audit trails, execution control. A pilot is a pilot precisely because it stops at the model layer. It produces output in a safe environment where nothing it generates has consequences.
The moment you ask it to act inside real systems, the question changes from can the model answer correctly to should this action be allowed, and who is accountable when it happens. That is the wall most pilots hit. Not a reasoning wall. A control wall. And no amount of additional model capability gets you over it, because the thing that is missing was never intelligence in the first place.
What the model builders themselves recommend
There is a quiet tell in how the labs talk about this. The two organizations with the most incentive to make agents sound effortless — Anthropic and OpenAI — both publish guidance that is notably restrained, and they largely agree.
Anthropic's engineering guidance draws a clean line between workflows, where models move through predefined paths, and agents, where the model directs its own steps. Its core recommendation is to use the simplest approach that works and add autonomy only when simpler patterns genuinely fall short — because autonomy trades predictability and cost for capability. Agents earn their keep, it argues, on tasks with clear success criteria, real feedback loops, and meaningful human oversight in place.2
OpenAI's guidance lands in the same place. It frames agents as the right tool for workflows where rigid automation has historically failed, and is explicit that you should not begin with full autonomy. Start narrow. Keep humans in the loop. Add guardrails. Route high-risk or irreversible actions — payments, deletions, anything you cannot take back — through a human approval step before the agent proceeds.3
Strip the branding from both and you get one specification:
- One job, done well — not a general-purpose everything-bot.
- Grounded in the real workflow — the agent has to know your steps, your tools, your rules.
- A human gate on consequential actions — autonomy is earned, not assumed.
- A feedback loop — the system should get sharper as it learns what you accept and reject.
The four properties Anthropic and OpenAI converge on. Together, they describe the layer underneath the model — not a smarter model.
Where DeployCo fits
This is the space DeployCo is built for. Not a better model, and not another wrapper around one — the deployment and governance layer that turns a stalled pilot into a system a business can actually run.
The shape of it follows directly from the evidence. Each engagement is built for one job rather than configured from a template. The agent is designed against your real workflow — its goals, its tools, and the boundaries of what it must not touch defined at the schema level. It runs from our infrastructure, not your team's. And every output it produces lands in your approval queue, with the context behind it, before anything reaches a customer. Nothing bypasses that gate, and every decision — what the agent did, what your team approved or rejected, what we changed — is recorded in an append-only audit log.
The point is not autonomy for its own sake. It is an agent narrow enough, grounded enough, and governed enough to survive contact with a real business — which, per the research, is the only kind that makes it out of the pilot stage.
From pilots to production
This transition will not announce itself. There will be no single moment when everything changes. It will simply become obvious, over the next couple of years, that some companies stopped experimenting with AI and started operating it — and that the difference between the two had little to do with which models they used.
The market will keep growing regardless; estimates for the size of the agentic-AI market by the early 2030s range widely, from roughly $47 billion to well over $130 billion depending on whose forecast you read.4 The growth is not in question. What is in question, for any individual business, is which side of the divide it lands on.
The winners will not be the ones who had the best models first. They will be the ones who figured out deployment, control, and governance early — the layer between what AI can do and what a business can safely let it do.
Citations & further reading
- MIT NANDA initiative, The GenAI Divide: State of AI in Business 2025 — see coverage of the report. The signal: pilots stall on integration and learning, not model quality.
- Anthropic, Building Effective AI Agents. Use the simplest pattern that works; reserve autonomy for tasks with clear criteria, feedback loops, and human oversight.
- OpenAI, A practical guide to building agents. Do not start with full autonomy; keep a human gate on high-risk, irreversible actions.
- Agentic-AI market-size estimates vary by methodology; see for example Fortune Business Insights and Mordor Intelligence. Treated here as a range, not a single forecast.