Skip to content
Engineering Notes · Agentic QA

Pointing an Agent at XCUITest: The Seven Things That Decide Signal From Noise

An agent driving a web app reads the DOM. An agent driving an iOS app reads the accessibility tree through a simulator, a proxy process, and a framework that relaunches the app on every test. The agentic pattern transfers, but the failure modes are different — and most of them are decided before the agent makes its first tap.

11 min read

TL;DR

  • iOS gives the agent less to work with than the browser — no DOM, just the accessibility tree XCUITest exposes. That makes accessibility identifiers load-bearing, not optional.
  • State is the first thing that bites. XCUITest relaunches the app each test, but the simulator keeps UserDefaults, Keychain, and the on-disk store. If you don't reset it deterministically, the agent inherits yesterday's session.
  • Mock the HTTP layer so the agent exercises screens like a user while the backend stays deterministic. A flaky API turns every perception and reasoning error into noise you can't separate.
  • Treat the agent's screen map (agent.yml) as versioned code. Watch the run for anomalies, let the agent propose the version bump, and keep the .xcresult bundle — it's the richest triage artifact Xcode gives you, and most teams throw it away.

Why iOS is a harder target than the browser

Most teams meet agentic QA on the web first. The agent gets a screenshot and a DOM, it reasons about what to click, and a tool layer translates that into a Playwright or Selenium call. The DOM matters here: even when the agent works visually, there is a complete, queryable description of the page underneath it as a fallback and a ground truth.

iOS takes that away. There is no DOM. XCUITest sees the app through the accessibility tree — the same tree VoiceOver uses — surfaced across a process boundary by a proxy that talks to your app under test. Elements resolve by accessibility identifier, label, and type, not by CSS selector. The app runs in a simulator (or on a device) that carries state between launches. And the framework relaunches your app, from a cold process, on essentially every test.

None of this makes agentic QA on iOS impossible. We run it. But it moves where the work is. On the web, most of the engineering is in the harness around the agent. On iOS, a surprising amount of it is in the app and the environment — decided before the agent ever makes its first tap. There are seven of these, and they are the difference between an agent that produces signal and one that produces a wall of non-reproducible red.

1. State between tests is the first thing that bites

Here is the trap. XCUITest launches a fresh process for each test, so engineers assume each test starts clean. It doesn't. The process is new; the simulator isn't. UserDefaults, the Keychain, Core Data or SQLite stores, the URL cache, and downloaded files all survive the relaunch. The first test logs in and writes a token to the Keychain. The second test launches, finds the token, skips the login screen it was written to exercise, and fails on a screen it never expected to see.

For an agent this is worse than for a scripted test, because the agent doesn't know what state it inherited. It sees a home screen where it expected a login screen, reasons that it must already be authenticated, and confidently does the wrong thing. You log that as a reasoning error. It isn't. It's a state-leak error wearing a reasoning error's clothes, and you'll chase it in the wrong place for a day.

Reset deterministically, and do it from outside the test logic. The levers, in order of how much we reach for them:

  • Launch arguments and environment. Pass a flag through XCUIApplication.launchArguments (e.g. -uitests-reset) and, on launch, have the app wipe UserDefaults for its suite, clear the Keychain, and delete its store before the first frame renders. This is the fastest reset and the one you'll use per-test.
  • A clean simulator. xcrun simctl erase between runs gives you a guaranteed-clean device. It's slow, so it belongs at the start of a run or a shard, not between every test.
  • An in-app reset endpoint, gated to test builds. A single entry point that returns the app to a known seed — logged out, empty store, default flags — invoked by a launch flag. This is the one the agent can call by name, which matters when you want the agent to reset and retry inside a budget rather than carry corruption forward.

If you take one thing from this section: a fresh process is not a fresh app. The simulator remembers. Reset on purpose, or the agent inherits a session it can't see and you mislabel the failure.

2. Mock the HTTP interface so the agent drives the app, not the backend

You want the agent to exercise the app the way a user does — tap through the real screens, real navigation, real view models — without the real backend underneath. A live backend introduces latency, rate limits, auth expiry, and data that changes between runs. Every one of those turns into a test failure that has nothing to do with the app, and on an agentic suite it's poison: you can no longer tell a perception or reasoning error from a backend hiccup, because they all look like red.

The clean seam on iOS is the HTTP layer. Register a URLProtocol subclass at launch — again, gated by a launch flag so it only loads under test — and serve canned responses from fixtures. The app makes its real network calls, the view models parse real-shaped payloads, the screens render exactly as they would in production. Only the bytes on the wire are deterministic.

The discipline that makes this pay off is keying fixtures to scenarios rather than to endpoints. "Empty inbox", "inbox with one unread", "payment declined", "server returns 500" — each is a named bundle of responses the agent selects via launch environment. Now the agent isn't just clicking around; it's driving a specific, reproducible journey, and when it fails you know precisely which backend state it was standing on.

Mock at the HTTP boundary, not in the view models. The point is to keep every screen, transition, and parse path real — only the network is canned. Stub higher up and you stop testing the thing you shipped.

3. Accessibility identifiers are the contract — keep them honest

On the web the agent can fall back to the DOM. On iOS the accessibility tree is all it has, and accessibility identifiers are how XCUITest names things in that tree. They are the contract between the app and the test. When they're missing, the agent is reduced to matching on visible labels — which are localised, which change for copy reasons, and which collide (three buttons that all say "Done"). When they're duplicated or unstable, element resolution becomes ambiguous and the agent's taps land on the wrong control.

This is the most common root cause we find behind "the agent is flaky on iOS". It usually isn't the agent. It's that nobody owns the identifiers, so they rot — a refactor renames a view, a new screen ships with none, a designer's relabel quietly changes what the agent was matching on.

Two things keep them honest. First, identifiers are set deliberately in the app — accessibilityIdentifier on UIKit views, the .accessibilityIdentifier() modifier in SwiftUI — and reviewed like any other interface, because they are one. Second, the agent audits them. Before a run, it walks the accessibility tree of each screen it knows about and flags identifiers that are missing, duplicated, or no longer resolve. That audit is cheap and it catches the rot before it becomes a 2am misclick that you waste a morning calling a reasoning error.

A missing accessibility identifier is not just a test problem. It's the same gap a VoiceOver user hits. When the agent can't find a control, that's an accessibility signal worth filing against the app, not flake to retry away.

4. Screen actions are code, and the agent maintains them

Above the raw identifiers sits a layer of screen actions — the iOS equivalent of page objects. "Log in as a returning user", "open the third item in the list", "pull to refresh and wait for the spinner to clear". These encapsulate the sequence of taps, waits, and assertions that make up a meaningful unit of behaviour, so the agent reasons in user intent instead of re-deriving a tap sequence from pixels every single run.

The reason to write them down rather than let the agent improvise each time is determinism. An agent that re-plans how to log in on every run will eventually plan it wrong; an agent that calls a named, stable login action does the same thing every time, and the variation is saved for where it earns its keep — exploring the screen under test. This is the same instinct as seeding the tool-call order on the web: pin the parts that shouldn't vary.

But screens change, and that's where the agent earns a second job. When the UI shifts — a field moves, a step is added to a flow, a button is renamed — the old screen action breaks. The agent detects the break, walks the new screen, and proposes an updated action: the new tap sequence, the new identifiers, the new waits. Crucially it proposes; it doesn't silently rewrite. A human reviews the diff and a new golden trace is recorded before the change ships. The agent does the tedious part — re-deriving the path through a changed screen — and a person keeps the judgement.

5. Watch the run to catch anomalies, not just failures

A pass/fail result tells you whether the assertion held. It tells you nothing about how the run got there. On iOS especially, the interesting problems live in the gap between green and clean — a run can pass while doing something it shouldn't, and that's the regression you ship.

So we watch the execution as a stream, not just grade the result at the end, and we flag anomalies independently of pass/fail:

  • Unexpected system interruptions. A permissions dialog, a Sign in with Apple sheet, a software-update nag — anything XCUITest has to dismiss with an interruption monitor. If one fires when the scenario didn't expect it, that's an anomaly even if the test still passes.
  • Transition timing drift. A screen push that usually settles in 300ms suddenly taking 1.5s. The assertion may still pass on the wait, but the drift is a performance regression announcing itself early.
  • Retry-shaped behaviour. The agent tapping a control, seeing nothing happen, and tapping again. A double-tap that lands on a now-enabled button looks like success and hides a real responsiveness bug.
  • Layout anomalies. Content rendering off-screen, clipped, or overlapping — visible to a screenshot diff even when the element the agent wanted was still technically resolvable.

None of these fail the test on their own. All of them are worth surfacing. The point of watching rather than grading is that you catch the regression that's polite enough to still pass — which is exactly the one your users will find first.

6. agent.yml is versioned, and the agent proposes the bumps

Everything the agent knows about your app — the screens, their identifiers, the named actions, the mock scenarios, the bounded step counts — lives in a declarative config. Call it agent.yml. The single most important decision you make about that file is to treat it as code: it lives in the repo, it's reviewed, and every change to it is a versioned, traceable diff.

This closes the loop with the anomaly watching. When the run surfaces an anomaly — an identifier that no longer resolves, a screen action that broke, a new interruption that needs handling — the response is not to hand-edit the config in a panic. The agent proposes a version bump: here is the screen that changed, here is the new identifier, here is the updated action, here is the golden trace that now passes. A human reviews it exactly like a code change, because it is one.

Versioning the config is what lets you answer the question every agentic suite eventually faces: did the suite change, or did the app? When agent.yml is versioned alongside the app, the diff tells you. When it's an opaque blob someone tweaks between runs, you're back to arguing about a red dashboard with no way to settle it. Pin the model version, pin the config version, and a regression has exactly one place left to hide — the app — which is the whole point.

An agent config you edit by hand between runs is not a test artifact; it's a moving target. Versioned, reviewed, and proposed-by-the-agent is what makes a model or UI change a diff you can read instead of a mystery you re-run.

7. Export the xcresult — it's the richest artifact you have

When xcodebuild runs your tests it produces an .xcresult bundle, and most teams glance at the pass/fail summary and bin it. That bundle is the single best triage artifact Xcode gives you, and for an agentic suite it's close to essential. Inside it: per-step screenshots, the full test activity timeline, console and system logs, performance metrics, and any attachments you chose to capture — including, if you wire it up, the screenshot the agent saw and the natural-language plan it emitted at each step.

Capture it deliberately with xcodebuild test -resultBundlePath, and read it programmatically with xcrun xcresulttool, which emits JSON you can parse. Then feed it back into triage. This is where the iOS suite finally gets the thing the web suite had for free: a ground truth to check the agent's perception against. The screenshot from the bundle plus the plan the agent wrote is exactly the pair you need to split a perception error (the plan describes a screen the screenshot doesn't show) from a reasoning error (the plan reads the screen right but picks the wrong next move).

Export it, attach it to the run, and keep it long enough to investigate. The bundle is also how you reconstruct an anomaly after the fact — the slow transition, the unexpected dialog, the double-tap — without re-running and hoping it reproduces. On a non-deterministic actor, the run you can't reproduce is the run you're glad you recorded.

The .xcresult bundle is the iOS answer to "how do I tell perception from reasoning". Screenshot plus emitted plan, per step, already captured. Don't throw it away.

What this adds up to

Read the seven back and a pattern shows up. Only two of them — watching the run and reading the bundle — are about the agent at runtime. The other five are about the environment the agent runs in: clean state, mocked network, honest identifiers, maintained screen actions, versioned config. That ratio is the real lesson of doing agentic QA on iOS. The browser lets you put most of the engineering in the harness; iOS makes you put it in the app and the environment first.

It's the same discipline a good iOS test shop already applied to launch arguments, mock servers, and accessibility a decade ago — transposed to a non-deterministic actor and made a bit more demanding because the actor can't read a DOM as a safety net. The framework is mature. The agent is new. The engineering between them is what decides whether you get signal or a wall of red, and almost all of it is decided before the first tap.

Five of the seven levers fire before the agent makes its first tap. On iOS the agent is only as good as the state, the network, and the identifiers you hand it — the rest is just tapping.

Key takeaways

  • iOS gives the agent the accessibility tree, not a DOM. Accessibility identifiers are the contract, and a missing one is an accessibility bug, not flake to retry away.
  • Reset state deterministically from outside the test — launch flags, a clean simulator, or a gated in-app reset. A fresh XCUITest process is not a fresh app; the simulator remembers.
  • Mock at the HTTP boundary so the agent drives real screens against a deterministic backend, with fixtures keyed to named scenarios rather than endpoints.
  • Write screen actions down and have the agent propose updates when the UI changes — propose, never silently rewrite; a human approves the diff and a new golden trace is recorded.
  • Watch the run for anomalies that still pass, version agent.yml like code, and export the .xcresult bundle — screenshot plus emitted plan per step is how you split perception errors from reasoning errors.

FAQs

Does the agent need source access to the iOS app, or can it test a built binary?+
It can drive a built binary through XCUITest, but it works far better with a test build that exposes the seams in this post — launch flags for state reset and network mocking, and accessibility identifiers on the controls. Those are app changes, not agent changes. Without them the agent is matching on visible labels and inheriting uncontrolled state, which is where most 'the agent is flaky on iOS' reports actually come from.
Why mock the network instead of using a staging backend?+
A staging backend still has latency, auth expiry, rate limits, and data that drifts between runs — and every one of those becomes a test failure unrelated to the app. On an agentic suite that's especially damaging because you can no longer separate a perception or reasoning error from a backend hiccup. Mocking at the HTTP boundary keeps every screen and parse path real while making the bytes deterministic. We still run against staging in a separate, smaller integration pass.
What exactly is agent.yml, and why version it?+
It's the declarative description of what the agent knows about your app — screens, accessibility identifiers, named screen actions, mock scenarios, step bounds. Versioning it in the repo, alongside the app, is what lets you answer 'did the suite change or did the app change?' from a diff instead of an argument. When an anomaly surfaces, the agent proposes the version bump and a human reviews it like any other code change.
How does exporting the xcresult bundle help triage?+
The bundle holds per-step screenshots, the activity timeline, logs, and any attachments — including the screenshot the agent saw and the plan it emitted. That pair is exactly what splits a perception error (plan describes a screen the screenshot doesn't show) from a reasoning error (plan reads the screen right but picks the wrong move). Capture it with xcodebuild -resultBundlePath and read it with xcrun xcresulttool. It also lets you reconstruct a non-reproducible anomaly without re-running and hoping.
Is this XCUITest-specific, or does it apply to other iOS test frameworks?+
The seven levers are mostly framework-agnostic — state reset, HTTP mocking, accessibility identifiers, versioned config, and result-bundle triage apply whether you drive the app through XCUITest, Appium, or another tool. We default to XCUITest because it talks to the same accessibility tree the OS uses and produces the xcresult bundle natively, which removes a layer of translation between what the agent sees and what the system actually rendered.

Running an agent against your iOS suite?

We scope agentic mobile QA the same way we scope web — start with state, network, and accessibility identifiers, version the agent config, and keep the xcresult bundle so every failure has a ground truth. No retries-to-green theatre.

Talk to us
About the authorVenkata Kari · Founder, GVK Technologies

Twenty years in QA leadership, most of it spent watching teams ship around a red dashboard. GVK Technologies builds and operates agentic test suites for product engineering teams — web, mobile, and API — see the case studies for measured runs against real apps.

Related case studyMobile QA Gaps — 474 Green Tests, Two Blind SpotsRead the study
Related postMaking an Agentic Test Run Boring: Determinism, Retries, and the Flake BudgetRead the post