Field Notes · Agentic QA Strategy

Hallucination, Flakiness, and Trust: How to Evaluate an Agentic AI Test Agent in 2026

Traditional test tools are deterministic — the same script runs the same way every time. Agentic tools plan, decide, and adapt, which means they can also hallucinate, flake, and behave inconsistently. Most procurement evaluations fail because they apply traditional buying criteria to a fundamentally different kind of system. Here is a vendor-neutral way to evaluate one with your eyes open.

10 June 20269 min read

TL;DR

Evaluating an agentic test agent is not like evaluating a traditional tool. Determinism is gone, so the old buying rubric does not transfer.
The three failure modes that matter are hallucination in test generation, non-determinism in execution, and black-box decisions you cannot audit.
The only valid evaluation runs the candidate against your application, your data, and your CI — never the vendor's demo app.
Use the twelve-point checklist below to stress-test for the failure modes, not just the happy path. If a tool cannot be meta-tested, that is your answer.

Why this is a different purchase

A traditional automation tool is deterministic. You write a script, it executes the same steps in the same order every run, and when it fails you know the assertion was false or the selector did not resolve. Your evaluation can be a feature checklist because the behaviour is fixed.

An agentic test agent does not work this way. It is given an intent and it plans, decides, and adapts its way to a result. That adaptability is the value — it is also the risk. The same agent can take a different path on two identical runs, generate an assertion that looks plausible and is wrong, or make a decision it cannot explain. None of those failure modes exist in the deterministic world your procurement process was built for.

Deloitte frames the agent-orchestrated development life cycle around a human supervising the agent as a collaborator. Most procurement has not caught up: it still buys agents as tools. The evaluation has to change to match what you are actually buying.

The hallucination problem

When an agent generates a test, it can produce an assertion that reads sensibly and asserts the wrong thing. It checks that a confirmation message appears, but against a flow your application does not have. It validates a total that it reasoned its way to rather than read from the screen. The test passes, looks like coverage, and protects nothing.

This is the agentic version of the false-positive problem, and it is more dangerous than a flaky failure because it is silent. A red test gets attention. A green test that asserts nothing real gets trusted. The Capgemini World Quality Report 2025–26 puts hallucination and reliability among the top three barriers to adopting GenAI in quality engineering, cited by 60% of executives — and assertion design is where it bites hardest.

A flaky test that fails is annoying. A hallucinated test that passes is dangerous — it looks like coverage and protects nothing. Evaluate for the second, not just the first.

Flakiness versus determinism in execution

Beyond what it generates, an agent has to execute, and execution is where non-determinism shows up as flake. The same test, same build, same data, run twice, takes two different paths — and sometimes one passes while the other fails. In a deterministic suite that is a defect. With an agent it can be the model's temperature, an ambiguous screen, or a plan that branched differently for no good reason.

A short vendor trial on a stable demo will not surface this, because the demo app is forgiving and the runs are few. You only see the real flake rate when you run the agent many times against an application that behaves like production. An evaluation that does not measure reproducibility across repeated runs has not measured the thing that will decide whether your team trusts the tool.

Meta-testing: the part everyone skips

The single most overlooked step in evaluating an agentic test agent is testing the tester. You would never adopt a measuring instrument without checking it against a known standard, yet teams routinely adopt an agent on the strength of a demo and never ask how often it is wrong.

Meta-testing means feeding the agent cases where you already know the answer — known defects it should catch, and known-good flows it should pass — and counting how often it gets them right. It means running the same case repeatedly to measure reproducibility, and inspecting whether the agent can explain a decision after the fact. This is the same discipline we describe in our notes on evals being the test suite for your test suite: an agent you cannot meta-test is an agent you cannot trust, however good the demo looked.

Red flags in the demo

A vendor demo is built to show the agent succeeding. Your job is to find where it fails, and a handful of signals tell you more than any polished walkthrough.

The agent is only ever shown on the vendor's own stable demo app. Ask to point it at a messy slice of your product, and watch how readily they agree.
Decisions happen with no visible review step. If the pitch is that it 'just works', ask where a wrong decision would surface. Silence is the answer.
No reproducibility story. If the vendor cannot quote a flake rate across repeated runs, they have not measured it — which means you will, after you have bought it.
ROI quoted as one industry percentage rather than meta-tested against your own known cases. A serious vendor expects you to test the tester.

Green flags worth paying for

Most of this post is about what to be wary of, so it is worth naming the signals that justify a premium. These are the things a serious agentic test vendor will offer and a weak one will deflect.

Reproducibility controls — the ability to pin temperature, seed decisions, and get the same first action on the same input. A vendor who understands determinism has built for it.
Explainable decisions — for every step, the agent can show you what it saw and the plan it formed, so a failure can be triaged as perception or reasoning rather than shrugged off.
Honest failure surfacing — non-trivial decisions are logged and reviewable, not silently executed. The tool tells you when it was unsure.
A meta-testing story — the vendor expects you to test the tester against your own known cases and helps you do it, rather than steering you back to the demo.

The twelve-point agentic test agent evaluation checklist

A vendor-neutral checklist for evaluating an agentic AI test agent. Run every point against your own application, data, and CI — not the vendor's demo.

Run it on your application, not the demoThe whole evaluation is invalid on the vendor's stable demo app. Insist on running against a representative slice of your own product.
Measure the hallucination rateFeed it flows you know, and count how often it generates assertions that look right but check the wrong thing.
Measure reproducibility across repeated runsRun the same case many times on the same build and data. Record how often the path and the verdict differ.
Test detection of known defectsSeed bugs you have already fixed in history. A tool that misses known defects will miss new ones.
Test for false positives on known-good flowsRun flows you know are correct. Count how often the agent flags a problem that is not there.
Check explainability of every decisionFor a failed step, can the agent show what it saw and the plan it formed? If not, triage is impossible.
Audit the decision trailConfirm there is a reviewable record of non-trivial decisions, not silent execution you cannot reconstruct.
Probe data and model dependenciesAsk what training data and which external model providers the agent depends on, and what changes when they change.
Test environmental sensitivityVary timing, viewport, and load. A tool whose verdict changes with the weather will erode trust in CI.
Confirm CI integration is realRun it inside your pipeline, not beside it. Measure feedback time and how failures present to the team.
Establish the human review loopDecide where a non-trivial agent decision surfaces for a person. A tool with no review step is a liability, not a feature.
Get a procurement-ready residual-risk statementDocument the risk that remains with the chosen tool, so security and governance can sign off on a known position, not a hope.

You would not adopt a measuring instrument without checking it against a known standard. An agentic test agent is a measuring instrument. Test the tester, or you are trusting a number you never verified.

Key takeaways

Agentic test agents are non-deterministic, so the deterministic-tool buying rubric does not transfer. Evaluate the failure modes, not a feature list.
Hallucinated assertions are the dangerous failure — they pass, look like coverage, and protect nothing. Measure the rate explicitly.
Reproducibility only shows up across many runs on a production-like app. A short demo trial cannot surface the real flake rate.
Meta-testing — feeding the agent known answers and counting how often it is right — is the step most evaluations skip and the one that matters most.
Run all twelve checks against your own application, data, and CI. A tool that cannot be meta-tested is a tool you cannot trust.

FAQs

Why can't we just use our existing tool-evaluation rubric?+

Because your existing rubric assumes determinism — that a tool behaves the same way every time. Agentic agents plan and adapt, so they can hallucinate, flake, and make black-box decisions. Those failure modes are not on a traditional rubric, which is why traditional rubrics pass tools that later fail in production.

What is meta-testing, in practice?+

Testing the tester. You feed the agent cases where you already know the answer — known defects it should catch and known-good flows it should pass — and count how often it is right. You also run the same case repeatedly to measure reproducibility, and check whether it can explain its decisions. It is the agentic equivalent of calibrating an instrument before you trust its readings.

How many runs do we need to measure reproducibility?+

Enough that a rare divergence has a chance to appear — many more than a vendor trial typically includes. A handful of runs on a stable demo tells you nothing. Repeated runs against a production-like application are where non-determinism becomes visible and measurable.

Is a non-deterministic test tool ever acceptable?+

Yes, if the non-determinism is bounded, measured, and reviewed. The goal is not zero variation; it is a flake rate inside a budget you have chosen, with a human review loop for non-trivial decisions. An agent that varies unpredictably and cannot explain itself is the one to avoid.

What if none of the tools pass the checklist?+

That is a valid and valuable outcome. A clear 'not yet' saves the spend on a tool that would not survive contact with your production application, and tells you what has to change — in the tools or in your suite — before agentic testing is the right move. We would rather hand you that answer than a tool you will regret.

Evaluating an agentic test tool?

We run structured, vendor-agnostic evaluations against your own application, data, and CI — meta-testing each candidate for hallucination, reproducibility, and explainability, and handing you a procurement-ready recommendation that security and engineering can both sign off. Including, when it is true, the answer that none of them fit yet.

Request the evaluation checklist

About the authorVenkata Kari · Founder, GVK Technologies

Twenty years in QA leadership, lately spent helping teams buy agentic test tooling without buying the demo. GVK Technologies runs vendor-agnostic evaluations against the client's own application, data, and CI — and will tell you when no tool on the shortlist is the right fit.

All posts