Engineering Notes · Agentic QA

Espresso, UI Automator, and an Agent: Taming the Android Device Matrix

An iOS suite fights one runtime. An Android suite fights a few hundred — OEM skins, API levels, screen densities, and a green run on a Pixel that turns red on a mid-range Samsung. The agentic pattern carries over from iOS, but fragmentation changes which levers matter most, and the hardest job becomes telling a device problem apart from a real bug.

1 June 202611 min read

TL;DR

Android's failure surface is the device matrix. The same build passes on one device and fails on another, and an agent that can't separate 'device-specific' from 'real bug' just multiplies the noise.
State leaks harder than on iOS. Instrumentation gives you a fresh process, not a fresh app — clear app data deterministically with pm clear or Android Test Orchestrator's per-test isolation, or you inherit the last test's session.
resource-id and content-description are the locator contract. They rot under refactors, they're often missing on custom views, and a missing one is the same gap a TalkBack user hits.
Mock at OkHttp/MockWebServer, version the agent's screen map, watch for OEM-specific interruptions, and keep the full artifact set — logcat, screenshots, and failure video are the ground truth a flaky matrix demands.

Why Android is a different fight from iOS

If you've read our piece on agentic iOS testing, the spine of this one will feel familiar: reset state, mock the boundary, keep the locators honest, maintain the screen actions, watch the run, version the config, export the artifacts. The framework changes — Espresso for in-process interaction, UI Automator for cross-app and system UI — but the discipline is the same.

What changes is the shape of the problem. iOS gives you a small, well-behaved set of devices and one vendor's idea of how the UI should work. Android gives you fragmentation: a long tail of OEM skins, a spread of API levels you still have to support, screen densities and aspect ratios that reflow your layout, and manufacturer customisations that change how a permission dialog looks or whether a back gesture does what you expect.

That fragmentation is the whole story. A scripted Android suite already spends most of its flake budget on 'works on the emulator, fails on the device' problems. Point an agent at that matrix without preparing for it and you don't get insight — you get the same noise, now narrated in fluent English. The work is in making the matrix legible, and almost all of it happens before the agent's first tap.

1. State leaks harder than you think

Android instrumentation runs your tests in an app process, and engineers assume each test starts clean because the test runner looks like it isolates them. It doesn't isolate the thing that matters. SharedPreferences, the app's SQLite or Room database, the internal files directory, the WebView cache, and account state all survive between tests on the same device. The first test signs in; the second launches into an authenticated home screen it was written to reach from a login flow, and fails on a screen it never planned for.

An agent handles this worse than a script, because it doesn't know what it inherited. It sees a home screen, reasons it must already be signed in, and confidently does the wrong next thing. You log that as a reasoning error and go hunting in the agent. The bug is upstream — a state leak wearing a reasoning error's clothes.

Reset deterministically, from outside the test body. In rough order of how often we reach for each:

Android Test Orchestrator. Run each test in its own instrumentation invocation and use clearPackageData so app data is wiped between tests. This is the cleanest per-test reset Android gives you natively, and it's where we start.
pm clear from the harness. adb shell pm clear your.package.id resets app data to a freshly-installed state. Slower than in-process resets, but unambiguous, and a good choice at the start of a shard.
A test-only reset entry point, gated to debug builds. A single intent or launch flag that returns the app to a known seed — signed out, empty database, default feature flags. This is the one the agent can invoke by name when you want it to reset and retry inside a logged budget rather than carry corruption forward into the next step.

A fresh instrumentation process is not a fresh app. The device remembers SharedPreferences and the database across tests. Reset on purpose, or you will mislabel a state leak as an agent reasoning error and debug the wrong layer.

2. Mock at the HTTP boundary, scenario by scenario

You want the agent to drive the real app — real screens, real view models, real navigation — against a backend that never wavers. A live or staging backend brings latency, auth expiry, rate limits, and data that drifts run to run, and every one of those becomes a red test that has nothing to do with the app. On an agentic suite that's corrosive: you can no longer tell a perception or reasoning failure from a backend hiccup, because all three render as red.

On Android the clean seam is the HTTP client. Inject a MockWebServer instance, or register an OkHttp interceptor in debug builds, and serve canned responses from fixtures. The app makes real calls, parses real-shaped payloads, and renders exactly as it would in production. Only the bytes on the wire are deterministic.

The detail that makes this pay off is keying fixtures to named scenarios rather than to endpoints: 'empty feed', 'one unread message', 'payment declined', 'server returns 503'. The agent selects a scenario through an instrumentation argument, so it isn't just poking around — it's driving a specific, reproducible journey. When it fails, you know exactly which backend state it was standing on, on which device.

Mock at the network client, not in the view models or repositories. The point is to keep every screen, parse path, and transition real — only the wire is canned. Stub higher and you stop testing the app you ship.

3. resource-id and content-description are the contract

Espresso can match on view properties and UI Automator matches on the accessibility-exposed tree, but in practice the stable handle an agent relies on is the view's resource-id, backed by content-description for the things screen readers need. That pair is the contract between the app and the test. When it's missing — and it's routinely missing on custom views, RecyclerView rows, and anything drawn with Canvas — the agent falls back to matching on visible text, which is localised, which changes for copy reasons, and which collides (three buttons reading 'OK').

This is the single most common root cause we find behind 'the agent is flaky on Android'. Usually it isn't the agent. It's that nobody owns the identifiers, so they rot: a refactor renames a view, a Compose migration drops the testTag, a new screen ships with nothing stable to grab.

Two habits keep them honest. First, identifiers are set deliberately — android:id and contentDescription on View-based UI, Modifier.testTag and semantics on Compose — and reviewed like any other interface, because that's what they are. Second, the agent audits them: before a run it walks the tree of each screen it knows and flags identifiers that are missing, duplicated, or no longer resolve. That audit is cheap and it catches the rot before it becomes a 2am misclick you waste a morning blaming on the model.

A missing content-description isn't only a test problem — it's the exact gap a TalkBack user falls into. When the agent can't find a control, that's an accessibility signal to file against the app, not flake to retry away.

4. Screen actions are code the agent maintains

Above the raw identifiers sits a layer of screen actions — Android's version of the page object. 'Sign in as a returning user', 'open the third item in the list', 'pull to refresh and wait for the spinner to clear'. These wrap the taps, waits, and assertions that make up a meaningful unit of behaviour, so the agent reasons in user intent instead of re-deriving a gesture sequence from the screen on every run.

Writing them down rather than improvising each time is what buys determinism. An agent that re-plans how to sign in every run will eventually plan it wrong; an agent that calls a named, stable sign-in action does the same thing every time, and saves its judgement for the screen actually under test. It's the same instinct as pinning tool-call order: fix the parts that shouldn't vary.

But Android screens move — a field shifts, a step joins a flow, a Material component swaps for its Material 3 successor with different semantics — and the old action breaks. The agent detects the break, walks the new screen, and proposes an updated action: new gestures, new identifiers, new waits. It proposes; it does not silently rewrite. A human reviews the diff and a new golden trace is recorded before it ships. The agent does the tedious re-derivation; a person keeps the judgement.

5. Watch the run — most of the matrix's lessons hide in passing tests

A pass/fail result tells you the assertion held. It says nothing about how the agent got there, and on a fragmented device matrix the most expensive problems are the ones that still pass. We watch the run as a stream and flag anomalies independently of the result:

OEM-specific interruptions. A manufacturer's battery-optimisation nag, a 'this app was built for an older version of Android' dialog, a vendor sign-in sheet — anything UI Automator has to dismiss that the scenario didn't anticipate. It may not fail the test; it's still an anomaly worth surfacing, often device-specific.
Async timing drift. A list that usually settles in 300ms taking 1.5s on a slower device. Espresso's idling resources may absorb the wait and pass, but the drift is a performance regression announcing itself on the low end first.
Retry-shaped behaviour. The agent tapping a control, seeing nothing happen, tapping again. A double-tap that lands on a now-ready button reads as success and hides a real responsiveness bug — frequently only on lower-tier hardware.
Device-specific layout breakage. Content clipped, off-screen, or overlapping at a density or aspect ratio you don't develop on. A screenshot diff catches it even when the target view still technically resolves.

The hardest call on Android is triage: is this failure the device or the app? The agent's job is to make that separable, not to answer it alone. When the same scenario passes on twelve devices and fails on one, that's a device-specific signal — route it to whoever owns device coverage. When it fails the same way across the matrix, it's a real bug. Logging the device profile alongside the labelled failure class is what turns a flaky matrix into a sortable list instead of an argument.

6. Version the agent's map of the app

Everything the agent knows about your app — the screens, their identifiers, the named actions, the mock scenarios, the device matrix it runs against, the bounded step counts — lives in a declarative config. Treat that file as code. It lives in the repo, it's reviewed, and every change is a versioned, traceable diff.

This closes the loop with the anomaly watching. When a run surfaces a broken identifier, a changed screen, or a new OEM interruption that needs handling, the response is not a panicked hand-edit. The agent proposes a version bump — here's the screen that changed, here's the new identifier, here's the device it showed up on, here's the golden trace that now passes — and a human reviews it like any code change.

Versioning the config answers the question every agentic suite eventually faces: did the suite change, or did the app? When the config is versioned alongside the app, the diff tells you. Pin the model version, pin the config version, pin the device matrix, and a regression has exactly one place left to hide — which is the entire point.

An agent config you tweak by hand between runs is a moving target, not a test artifact. Versioned, reviewed, and proposed-by-the-agent is what makes a model upgrade or a UI change a diff you can read instead of a mystery you re-run.

7. Keep the artifacts — a flaky matrix demands ground truth

Android won't hand you a single tidy bundle the way Xcode's xcresult does, so you have to assemble the ground truth yourself — and on a device matrix you cannot afford not to. The set we always capture: the instrumentation/Gradle test report, full logcat per test, per-step screenshots, and, where the runner supports it, failure video. Add the two artifacts that make agentic triage possible: the screenshot the agent saw and the natural-language plan it emitted at each step.

Capture screenshots and logs through the runner (the AndroidX test APIs write them into the run, and managed device or cloud-device runs collect them per device), then pull them off with adb and attach them to the run. The reason this matters more on Android than anywhere else: the failure you most need to understand is the one that only happened on one device, and you may not have that device on your desk. The recording is how you investigate without re-running and praying it reproduces.

The screenshot-plus-plan pair is also how you split a perception error (the plan describes a screen the screenshot doesn't show) from a reasoning error (the plan reads the screen correctly but picks the wrong move). On a non-deterministic actor running across a non-uniform fleet, the run you can't reproduce is the run you're grateful you recorded.

What this adds up to

Read the seven back and the Android-specific weight is obvious. The locator contract is heavier because there's no DOM and custom views routinely ship without identifiers. The artifact set is heavier because there's no native bundle and the failing device may be one you don't own. And the anomaly watching carries an extra job no other platform demands as sharply — separating a device problem from an app problem, every single run.

None of it is exotic. It's the discipline a good Android shop already applied to idling resources, device farms, and TalkBack a decade ago, transposed onto a non-deterministic actor and made more demanding by fragmentation. The framework is mature, the agent is new, and the engineering between them is what decides whether the device matrix becomes signal or just louder noise.

On iOS the agent fights one runtime. On Android it fights the device matrix — and the real skill isn't driving the app, it's telling a device problem apart from a real bug, every run.

Key takeaways

Android's failure surface is fragmentation. Log the device profile alongside a labelled failure class, or you cannot tell a device-specific failure from a real regression.
Reset state deterministically — Android Test Orchestrator's clearPackageData, pm clear, or a gated in-app reset. A fresh instrumentation process is not a fresh app.
resource-id and content-description are the locator contract; a missing one is also a TalkBack accessibility gap, not flake to retry away.
Mock at OkHttp/MockWebServer with fixtures keyed to named scenarios; have the agent propose screen-action updates and version its config like code.
There's no xcresult on Android — assemble the ground truth yourself (logcat, screenshots, failure video, agent screenshot + plan), because the failing device may be one you don't own.

FAQs

Espresso or UI Automator — which does the agent drive?+

Both, for different jobs. Espresso gives fast, in-process interaction and synchronisation inside your app; UI Automator handles cross-app flows and system UI — permission dialogs, the notification shade, OEM sheets — that Espresso can't reach. The agent reasons in screen actions; underneath, those actions use whichever framework fits the surface. Most real journeys need both.

How do you stop the device matrix from exploding test time and cost?+

You don't run every scenario on every device. We run the full suite on a small, representative core set and shard the long tail — running targeted scenarios on the devices most likely to expose density, API-level, or OEM-specific issues. The agent logging the device profile against each labelled failure is what lets you prune the matrix intelligently instead of guessing.

Does this work with Jetpack Compose, or only View-based UI?+

Both. The locator contract just moves: android:id and contentDescription on Views, Modifier.testTag and semantics on Compose. The agent's identifier audit checks whichever applies. The most common Compose-specific failure we see is a migration that drops testTags a previous suite relied on — exactly the rot the pre-run audit is designed to catch.

Why mock the network instead of testing against staging?+

Staging still has latency, auth expiry, rate limits, and drifting data, and each becomes a test failure unrelated to the app — especially damaging on an agentic suite where you then can't separate it from a perception or reasoning error. Mocking at the HTTP client keeps every screen and parse path real while making the wire deterministic. We still run a smaller, separate integration pass against staging.

We already have an Espresso suite. Do we throw it away?+

No. The agentic layer wraps the same instrumentation your existing tests use and adds the labelling, state discipline, anomaly watching, and artifact capture on top. We almost always run the agentic and deterministic suites side by side rather than replacing the latter — the existing suite stays as fast regression coverage while the agent extends reach across the matrix.

Running an agent against your Android matrix?

We scope agentic Android QA the way we scope iOS — start with state, the network boundary, and the resource-id contract, version the agent config, and capture the full artifact set so a one-device failure is investigable. No retries-to-green theatre.

Talk to us

About the authorVenkata Kari · Founder, GVK Technologies

Twenty years in QA leadership, most of it spent watching teams ship around a red dashboard. GVK Technologies builds and operates agentic test suites for product engineering teams across web, mobile, and API — see the case studies for measured runs against real apps.

All posts