Field Notes · Agentic QA Strategy

Pilot Purgatory: Why Most Agentic AI QA Projects Stall Before Production

Nearly nine in ten organisations are already running generative AI in quality engineering. Only 15% have it working at scale. The gap between those two numbers is where most UK software teams are stuck right now — not because the technology fails in the demo, but because nobody built the scaffolding to carry it into production. This is how you escape.

4 June 20268 min read

TL;DR

The Capgemini World Quality Report 2025–26 puts 89% of organisations piloting or deploying GenAI in quality engineering, yet only 15% at enterprise scale. The story of 2026 is the pilot-to-production gap, not adoption.
Pilots die at rollout for four structural reasons — governance, data, integration, and skills — none of which a better tool fixes.
The most common failure is the shadow pilot: an impressive demo run outside CI, against a low-risk app, with no baseline and no production-readiness criteria. It can never generalise, so it never ships.
A disciplined 90-day path — baseline first, integrate early, define 'ready' up front, name an owner — is what separates the 15% from everyone else.

The gap, by the numbers

The defining quality-engineering story of 2026 is not whether organisations are adopting agentic AI for testing. That argument is over. The Capgemini World Quality Report 2025–26, drawn from more than two thousand senior executives across twenty-two countries, finds 89% of organisations piloting or deploying generative-AI-augmented QE workflows. The real question is whether any of it reaches production — and here the picture is far less flattering.

The same report puts 37% of organisations in production and 52% still in pilot, but only 15% operating at enterprise scale. Read that again. For every team that has agentic QA running as a dependable part of how it ships software, there are five or six who have a promising pilot and no route out of it.

We see the same shape on almost every engagement. The pilot worked. The demo landed. And then it sat there for two quarters, quietly, while everyone waited for someone to make it real.

Why pilots win the demo and lose the rollout

An agentic QA pilot is designed to succeed. You pick a contained application, give it a clean run, and show the room an agent driving a browser or an app with no script behind it. It is genuinely impressive, and it should be — the technology works.

Rollout is a different game entirely. Now the agent has to run inside a CI pipeline that was not built with it in mind, against applications with real authentication, real data, and real consequences when a release goes wrong. The questions change from "can it do this?" to "can we trust it every time, can we explain it to security, and who fixes it at 6pm on a Friday?" Those are not technology questions. They are operating questions, and the pilot was never set up to answer them.

This is the moment most pilots quietly die. Not in a meeting where someone cancels them, but in the absence of one — no owner, no criteria, no next step.

A pilot proves the technology can work once. Production proves the organisation can depend on it. Those are different proofs, and the second one is the hard one.

The four structural blockers

When we trace a stalled pilot back to its root cause, it almost always lands on one of four blockers. None of them is the model. None of them is fixed by switching tools.

Governance. Microsoft's UK Cloud data shows more than half of UK organisations still have no formal AI strategy. With no governance scaffolding to plug into, an agentic QA pilot has nowhere to land — every decision becomes a one-off argument, and the safest answer is always 'not yet'.
Data. Data privacy is the single largest barrier in the World Quality Report 2025–26, cited by 67% of executives. Test data carries personal information; the agent stack often involves third-party model providers. If nobody has worked out what flows where, the pilot cannot clear a security review, full stop.
Integration. Integration complexity is cited by 64%. A pilot run beside the pipeline is a science experiment; a pilot run inside it is a production system. Most pilots never make that crossing, because the crossing is the genuinely hard engineering and it was deferred.
Skills. Hallucination and reliability concerns are cited by 60%, and they are as much a skills gap as a technology limit. A team that has not learned to read, triage, and trust agentic failures will treat every red run as a reason to stop. Without that capability in-house, the pilot has no one to carry it.

Notice that three of these four are organisational, not technical. That is the uncomfortable finding: the thing blocking your pilot is usually not in the code.

The shadow-pilot trap

There is a specific anti-pattern that accounts for a disproportionate share of stuck pilots, and it is worth naming directly. We call it the shadow pilot.

A shadow pilot runs outside CI, on an engineer's machine or a side project, against a low-risk application chosen precisely because it is forgiving. It produces a great screen recording. It also captured no baseline — nobody measured the defect escape rate, the maintenance hours, or the pipeline duration before the pilot began — so there is no honest way to prove it improved anything. Procurement, security, and data governance were not in the room, which means they become blockers later rather than collaborators now. And there is no definition of done, so when someone finally asks "is this ready for production?", the only available answer is a shrug.

A shadow pilot can run forever and never graduate, because everything that makes it easy to run is exactly what makes it impossible to scale. The way out is not a better demo. It is a pilot designed from the first day to become a production system.

If your pilot cannot answer 'compared to what?' with a number, it is a demo, not a pilot — and demos do not ship.

A 90-day escape framework

For UK mid-market teams, the route out of pilot purgatory fits comfortably inside a quarter. The structure matters more than the calendar, but ninety days is the right order of magnitude.

Days 0–30 are about measurement and scope. Before touching a tool, capture the baseline you will be judged against: current defect escape rate, test maintenance hours, flakiness, and pipeline duration. Pick one application that genuinely represents production — real auth, real data shapes, real release pressure — not the forgiving toy that makes the demo easy.

Days 30–60 are about integration and the awkward conversations. The agentic tests go into the real CI pipeline now, not beside it, because the integration is the hard part and deferring it is how pilots stall. This is also when security, data governance, and procurement come in — as collaborators while there is still time to design for their concerns, not as a gate that rejects the work at the end.

Days 60–90 are about the production-readiness decision. You measure against the day-zero baseline, run the readiness checklist below, and make an explicit call with the people who own the risk. Either it ships with a named owner and a documented basis, or you stop with a clear, defensible reason. Both of those are wins. The only failure is another quarter of drift.

The companion discipline here is engineering the agent to behave predictably enough to trust — flake budgets, labelled failure classes, and the rest. We have written about that separately in our notes on making an agentic test run boring; this framework is what wraps around it organisationally.

Where this points

The enterprise direction of travel is platform-led. Deloitte has embedded UiPath Test Cloud into its Ascend delivery platform, and the other large consultancies are moving the same way. For a FTSE 100 buyer with an eight-figure transformation budget, that model makes sense.

Most UK mid-market software teams cannot and should not procure on that basis, and they do not need to. The capability that gets a pilot to production is mostly discipline — baselines, integration, criteria, ownership — applied by senior people who have done it before, working alongside your team rather than handing you a platform to maintain. That is the model we run, and it is deliberately the opposite of the platform play.

Agentic QA pilot readiness checklist

Seven checks that separate a pilot that can reach production from one that will drift. Run them before you decide whether to scale.

Capture a baseline before you touch a toolRecord defect escape rate, test maintenance hours, flakiness, and pipeline duration as they are today. Without a before, there is no provable after.
Pilot on a production-representative appChoose an application with real authentication, real data shapes, and real release pressure — not the forgiving side project that makes the demo easy.
Wire the pilot into real CI from day oneIf the agent runs beside the pipeline rather than inside it, you have deferred the hard part. Integration is the crossing most pilots never make.
Bring governance, security, and procurement in earlyInvite them as collaborators while the design is still fluid. Excluded now, they become the blockers that reject the work at rollout.
Define production-readiness criteria up frontAgree what 'ready' means before you start, so the question at the end is a measurement, not a debate. Include a flake budget and a failure-triage process.
Name an owner for the capabilityDecide who owns the agentic QA capability after the pilot ends. An orphaned capability stalls no matter how good the pilot was.
Set an explicit exit decision with a datePut a date on the production-readiness call. On that date you either ship with a documented basis or stop with a clear reason. Drift is the only losing outcome.

The thing blocking your agentic QA pilot is almost never the model. It is the absence of a baseline, an integration, a definition of done, and an owner — and no vendor demo will ever supply those for you.

Key takeaways

Adoption is settled; scale is the problem. 89% are piloting or deploying GenAI in QE, but only 15% operate at enterprise scale (World Quality Report 2025–26).
Pilots stall on four structural blockers — governance, data, integration, skills — and three of the four are organisational, not technical.
The shadow pilot — outside CI, on a forgiving app, with no baseline or criteria — is engineered to impress and structurally unable to scale.
Capture a baseline before you start, or you can never prove the pilot worked. 'Compared to what?' must have a number.
A 90-day path that measures first, integrates early, defines 'ready' up front, and names an owner is what separates the 15% from the rest.

FAQs

What actually counts as 'enterprise scale' for agentic QA?+

In the World Quality Report's framing it means agentic QE is a dependable, standard part of how the organisation ships — running inside CI across multiple teams and applications, governed, owned, and trusted — rather than a pilot confined to one squad or one app. The 15% figure is organisations that have reached that state, not those merely using the tools.

We already have a working pilot. Why isn't that enough?+

A working pilot proves the technology can succeed once, on a contained problem. Production proves your organisation can depend on it every time, explain it to security, and maintain it. Those are different proofs. Most stalled pilots are technically fine and organisationally unready — no baseline, no CI integration, no production-readiness criteria, no owner.

Do we need to buy a platform like the large consultancies use?+

Almost certainly not, if you are a mid-market team or scale-up. Platform-led models such as Deloitte Ascend with UiPath Test Cloud are built for global enterprises with the budget and procurement weight to absorb them. The capability that gets a pilot to production is mostly discipline applied by senior people, which does not require a platform purchase.

How long does it realistically take to get from pilot to production?+

For a UK mid-market team with one production-representative application, a disciplined effort fits inside roughly ninety days: a month to baseline and scope, a month to integrate into CI and bring governance in, a month to measure against the baseline and make an explicit readiness decision. The calendar flexes; the order does not.

What is the single most common reason pilots fail to scale?+

No baseline. If you did not measure defect escape rate, maintenance hours, flakiness, and pipeline duration before the pilot, you cannot prove it improved anything — so it cannot survive a procurement or governance review. Capturing the baseline on day zero is the cheapest, highest-impact thing most teams skip.

Stuck in pilot purgatory?

We run a 30-minute readiness review that locates exactly where your agentic QA pilot is stuck — governance, data, integration, or skills — and gives you the next concrete step toward production. No platform to procure, no slide deck.

Book a readiness review

About the authorVenkata Kari · Founder, GVK Technologies

Twenty years in QA leadership, much of it spent watching promising pilots stall for reasons that had nothing to do with the technology. GVK Technologies works alongside in-house teams to take agentic QA from proof-of-concept to a production capability the client owns — not a platform they rent.

All posts