Engineering Notes · Agentic QA

The Self-Driving Codebase

An AI agent run as an engineering team: steered by ADRs, specs and design decisions, braked by evals, and stopped — on purpose — at the handful of calls a human should still be making. An honest field report from a year of building a cross-stack product this way, failures included.

6 June 202618 min read

TL;DR

For the better part of a year I've run an AI coding agent as a standing engineering team — it plans sprints, writes code, reviews its own work through adversarial sub-agents, ships to main, and writes its own retrospectives — not as autocomplete.
The shift that made the autonomy work was treating durable, version-controlled artefacts (ADRs, specs, a live-state index, retros) as the source of truth, not the conversation. The more intent lives in artefacts, the more autonomously it runs.
Autonomy is safe to exactly the degree that 'done' is pinned to gates the agent can't fake — explicit test counts, pre-registered thresholds, held-out benchmarks — plus fresh-context adversarial review that doesn't share the implementer's blind spot.
The human stays on the 5% that's genuinely theirs: design intent, product semantics, scope, anything irreversible. The system compounds because every mistake becomes a one-line rule and then a mechanical guard.

1. From an idea to an app: the sequencing bet

The most consequential product decision came before any of the agent stuff, and it was about ordering.

The obvious way to build a consumer app is to build the user-facing bit first — screens, onboarding, the things people will actually see — and bolt the hard technical work on afterwards. I did it the other way round, deliberately. The whole point of the product is an on-device prediction of a physiological time-series, made without sending anyone's data anywhere. If that prediction is rubbish, no amount of lovely UI saves you. And privacy isn't something you can retrofit once the architecture's set.

So the bet was simple: in a pre-launch phase with no real users, the thing worth iterating is the algorithm and the foundation, not the polish.

The sequencing bet: foundation now, redesign next, release last.

That ordering is why the project looks the way it does. The model pipeline got a dozen sprints of iteration whilst the app was still, frankly, a test rig for predictions. The privacy invariant got wired into the backend contracts and the review gates before there was a single screen worth showing anybody. Only then did the redesign happen, and it happened design-system-first: build the tokens, the primitives and the audit gates, then port the fourteen screens onto them, rather than going screen by screen.

Was it the right call? Mostly. The cost was real and a bit demoralising — for a long stretch there was nothing demo-able, which is psychologically hard and makes you want to declare the foundation 'good enough' before it is. But the payoff came when the redesign finally arrived: it was additive, visual work sitting on a stable base. Fourteen screens in a handful of sessions, because the hard parts were already settled. You can't compress a design sprint like that if the ground's still moving under it.

The transferable bit: decide your build order on purpose, and write the reasoning down somewhere the agent can read it. 'Foundation now, redesign next, release last' lived as a stated posture in the agent's memory, and every time a session got tempted to gold-plate a screen before the model was any good, that one line pulled it back into line.

2. The thesis: artefacts are the steering wheel

The naive way to use one of these agents is conversational. You describe what you want, it produces code, you review, repeat. Grand for a single function. It falls apart for a system, because the agent has no durable memory of why the system is the way it is. Every session starts cold. It re-derives context, guesses at your conventions, and merrily reinvents decisions you settled weeks ago.

The shift that made the autonomy work was to stop treating the conversation as the source of truth and start treating durable, version-controlled artefacts as the source of truth:

ADRs (Architecture Decision Records). One file per decision, with the decision, the reasoning, the alternatives, and the conditions under which it'd be overturned. The agent reads these before it touches an area, so it doesn't re-argue settled questions or quietly break them.
Specs. A canonical functionality-and-design document that the sprint plans are derived from.
A live-state index. One root file that acts as the agent's 'what's true right now' map. It points at the authoritative per-subsystem context rather than duplicating it, and records what's broken, what's next, and what's been deferred on purpose.
Retrospectives. One per sprint. The durable record of what shipped and what got learnt.
Design decisions. Captured as locked specs and a token system, so 'make it match the design' has a concrete, testable referent.

The agent doesn't drive on vibes. It drives on these. The conversation becomes the steering input for whatever's next; the artefacts are the road. When I want it to change direction, I update an artefact (or answer a decision that then becomes one), and the change propagates, because the agent's reading the artefact, not scrolling back through chat.

That's really the whole trick. The more of your intent lives in durable artefacts, the more autonomously the thing can run, because it can rebuild context without you.

3. CoALA: the memory architecture that made it legible

The framework that best describes what's going on is CoALA (Cognitive Architectures for Language Agents), which splits an agent's memory into four kinds. Mapping them onto real files in the repo turned an abstract idea into something operational.

Four kinds of agent memory, each mapped to a real file in the repo.

The taxonomy is fine, but the genuinely useful part is the promotion pipeline between the layers.

The promotion pipeline: episode → rule → mechanical guard.

A bug caught once is an episode. If it shows up again, the retro promotes it into a one-line rule in semantic memory. If it's structural enough, it gets baked into procedural memory: a skill step or a hook that makes the mistake mechanically impossible next time.

A test-harness trap. Importing a particular device-storage-backed store into a previously-pure test module broke the test at load time — a silent module-load crash, not an assertion failure, so it read as a baffling error. First time round, an investigation loop. After the retro promoted it to a Known Pitfall, the next sprint's agent walked straight into the same situation, recognised it immediately, applied the documented one-line mock, and carried on. No investigation at all.

A token system that lied to the renderer. A design-token file was exporting CSS font stacks, which the mobile framework can't parse and silently swaps for the system font, no error raised. The fix wasn't just to correct it; it was to add a source-level audit gate, a test that scans the whole tree for the anti-pattern, so the class of bug can't quietly come back. Episodic lesson, procedural guard.

Each promotion makes the agent permanently better at the system level rather than just within a session. That's the difference between an agent that's handy today and a process that compounds. The 'Known Pitfalls' file is a couple of dozen lines now, and every line is a scar: something that cost time once and doesn't get to cost it again.

4. CDLC: context as a lifecycle, not a dump

The second framework I'll call the Context Development Lifecycle: treating the agent's context the way you'd treat code, as something you generate, evaluate, distribute and observe on a loop, rather than something you stuff into a prompt and hope about.

Context as a lifecycle: generate, evaluate, distribute, observe — then round again.

Two concrete mechanisms made this real: checkpointing and self-orientation. A hook writes a session-state checkpoint — last few commits, working-tree status, the current plan header — every time the agent's turn ends. Before a long task, or after the context window compacts, the agent reads that checkpoint to get its bearings instead of re-reading everything. Working memory, externalised.

And the shipping skill's very first step is, literally, 'orient yourself': read the checkpoint, confirm the task matches what was actually asked, and only then start typing. It's the cheapest possible insurance against an agent confidently beavering away on the wrong thing after a context reset.

The upshot is that a multi-hour, multi-task session can survive a compaction without losing the plot, because the plot lives in files rather than in the volatile window.

5. The team: personas and the adversarial review loop

I don't run one agent. I run a small scrum team of personas, each with a narrow remit and its own brief:

a manager (sprint shape, the dependency graph, the risk register),
a tech director (architecture, what's V1 versus later, what's deferred),
a senior developer (implementation blueprints, library calls, effort sizing),
a QA lead (acceptance criteria, the test pyramid, regression risk),
a designer (visual fidelity, token hygiene, the design contract),
and a review critic that runs last, specifically to find where the others contradict each other.

The most important choice here is that the reviewers run with fresh context. When a task's implemented, the diff goes to a code-reviewer sub-agent that has not seen the implementer's reasoning, only the code and the rules. For anything touching persistence, async or error handling, a second sub-agent runs alongside it: a silent-failure hunter whose entire job is to find places where an error gets swallowed.

Fresh-context, adversarial review — the single highest-impact part of the setup.

This separation isn't ceremony. It's the single highest-impact thing in the whole setup, and here's the proof. On more than one occasion the silent-failure hunter caught a new silent failure that the fix itself had just introduced.

The canonical case: a task was hardening an optimistic UI update so a failed write wouldn't vanish without trace. The first attempt rolled the UI back to a value captured before the write. The hunter flagged it as high-severity. Under interleaved updates, that captured value could be stale, so the 'fix' could now leave the UI quietly disagreeing with the database — a fresh silent failure, born inside a silent-failure fix. The rework reconciled against the database instead, the actual source of truth, and it came out both more correct and simpler than the first go.

An agent reviewing its own work in its own context would never have spotted that, because it shared the implementer's blind spot. Fresh-context, adversarial review is how you get the agent to genuinely take itself to task.

6. A worked example: one task, start to finish

Abstractions are cheap, so here's the whole machine turning over on one real (anonymised) task.

The carry. A retro's risk register had an honesty problem on it. A statistic on the insights screen was being shown as the user's self-reported setup estimate, but presented as though it had been computed from their logged data. As they logged more, the two would drift apart, and the screen would quietly start lying.

The human decision, up front. Before any code, the agent put a structured choice in front of me: derive the statistic from logged data now (more work, accurate), add a schema column plus a migration (cleanest long-term, biggest change), or defer and just label the estimate honestly? It gave a recommendation and the cost of each. I picked 'derive now', and that answer became part of the plan — an artefact, not a passing remark in chat.

Implementation. The agent pulled out a pure module: run-detection over the logged days, stitching runs across month boundaries, and some careful manual month arithmetic specifically to dodge a date-overflow pitfall that was already sitting in the Known Pitfalls register from an earlier sprint. The module had type-only domain imports so it could be unit-tested without dragging the database into the test runner. The viewmodel got thin wiring on top.

The gates. Full suite green, types clean, lint clean. Code reviewer: clean. Silent-failure hunter: one high-severity finding. A failed data read collapsed straight to 'fall back to the setup estimate', which meant a load failure would get rendered with a 'from your setup' label. The honesty feature would end up lying on error. The fix was a distinct error state the UI renders as silence rather than a false claim of provenance.

The aftermath. At the next retro, one persona confidently asserted the derivation had a 'sparse-logging gap' — that a user with months between events would lose data. The review critic didn't take it at face value. It went and read the loading code and disproved the claim outright: the scenario described couldn't happen, because distinct events are separate runs and the date-range assembly already covered every valid one. A plausible-sounding architectural worry, killed by reading the code, before it could pollute the register.

That's the loop in miniature. A human decision at the boundary, a pure and testable implementation, the deterministic gates, an adversarial reviewer catching a real self-inflicted bug, and a critic keeping the next sprint's planning honest. No single step is clever. It's the stacking of all of them that makes the autonomy safe to leave running.

7. Read the source before you name the fix

If I had to boil the most-repeated lesson down to one line, that'd be it. Time and again the plan's description of a task turned out to be wrong, and a few minutes reading the actual code put it right before a single line got written:

A task framed as 'add the missing variant' needed no variant at all. The real bug was a bare empty catch quietly swallowing a font-load failure. The named deliverable was a red herring; the genuine risk was the silent catch.
A task framed as 'fix the icon colour bug' was chasing a non-bug. The suspicious pattern was the shared, shipped convention across seventeen other icons. Changing it would have regressed all of them on an unverified hunch.
A task framed as 'which shade should this be?' had exactly one valid answer once you read the palette, because the palette was deliberately partial and the obvious shade failed the contrast gate.

The discipline that falls out of this is that the plan is a hypothesis, not an order. A good agent treats a task description as something to check against the code, not execute on faith. It's also why the human-authored spec and the machine-read code have to live side by side. The spec says what we want; the code says what's true; and the gap between them is precisely where the bugs and the rework hide.

8. The self-driving loop

What emerged is a closed loop, and it mostly runs itself.

The loop is the durable engine; individual sprints are just turns of the crank.

A retro closes a sprint and emits a carried-risk register: the honest backlog of what's been deferred, each item with an owner and a trigger condition. A planning skill turns that into an execution plan with tasks, a dependency graph, effort sizes, and the sequencing that keeps coupled tasks from treading on each other. A shipping skill runs each task through a fixed pipeline — orient, implement, run the full suite, type-check, spawn the reviewer (and the silent-failure hunter when it's warranted), apply the real findings, re-run, conventional-commit, push, update state. Then the next retro mines the sprint's commits for new pitfall classes and promotes them.

A 'gap-closure sprint', where you take a retro's register and shut all of it, became a recognisable, efficient shape: pre-scoped carries, a handful of up-front human decisions, mostly small tasks, and a fair bit of throughput in a single session.

One non-obvious efficiency: collect the human decisions up front. For a sprint that's closing a known register, the only real ambiguity is design and product intent at the edges. Surfacing those as one batch of questions before the agent starts coding gets rid of the mid-sprint interruptions that otherwise stall the individual tasks.

9. Where the human still sits in the chair

None of this scaffolding is about removing the human. It's about removing the human from the 95% of decisions that are mechanical, and pointing their attention at the 5% that are genuinely theirs.

The agent decides the reversible 95%; the human owns the 5% that's genuinely theirs.

What the agent decides on its own: implementation patterns, test design, refactors, which reviewer findings to apply, naming, library calls inside an established convention — anything reversible and internal.

What it stops for, every single time:

Design intent. 'Should this be sage or lilac?' The palette has constraints the agent can enforce (contrast, tokenisation), but which colour expresses the brand is a human's call. It hands me the options, the contrast maths, and a recommendation; I pick.
Product semantics. 'Should this metric be derived from logged data, computed via a schema migration, or left as the setup estimate?' Each is defensible, and the choice changes the product's behaviour and its data model. Not the agent's to make quietly.
Scope and ambition. 'Close the whole register, or just the launch-blockers?' That's a cost/value judgement, and it's mine.
Anything irreversible or outward-facing. Publishing, deleting, sending data to a third party, anything that's a pain to walk back.

The mechanism matters as much as the principle. These come over as structured choices with a recommendation and the trade-offs, not open-ended 'what do you fancy?' prompts. 'Here are three options, here's the one I'd pick and why, here's what each costs' respects the human's time and keeps the decision sharp. A good autonomous agent isn't one that never asks. It's one that asks rarely and well, and never about something it could have checked itself.

There's a failure mode on the other side too: an agent that asks too much is about as useless as one that asks too little. The calibration — this one's genuinely yours; that one I'll handle and tell you what I did — is the actual skill, and it took a while to get right.

10. How 'done' gets measured: the eval stack

This is the question that decides whether any of the above is trustworthy, because the thing writing the code is also the thing telling you it works. The answer is that 'done' is never the agent's opinion.

It's a stack of gates the agent can't talk its way past — four layers, every one of them green or it isn't done.

Done = every layer green. The agent can't move a gate it misses.

Deterministic (the floor). The full suite with the count spelt out — not 'tests pass' but '1751 of 1751 passing' — strict type-checking at zero errors, lint at zero, a hard bundle-size ceiling, localisation key-parity so a half-translated feature can't sneak through, and a cold-start performance gate for the on-device model on low-end hardware.
Semantic (the judgement layer). A code reviewer on every code-touching task, filtered against the project's own rules, and a silent-failure hunter in parallel for any persistence, async or error path — including, crucially, a re-review of the fix itself.
Domain (for the ML). A held-out benchmark on data it's never seen, reported as a concrete error metric that every future retrain must beat to ship; a privacy budget treated as a first-class, ADR-governed quantity; and a cross-platform parity check that demands a byte-identical prediction on both mobile platforms.
Visual (what the test runner can't do). For things a headless runner genuinely can't judge — does a hand-drawn icon look right at 24 device-pixels? — the agent drives a real browser, screenshots the rendered thing, and the screenshot goes into the repo as durable evidence.

The thread running through all four layers is the same. Evals are the brakes, and the agent isn't allowed near the brake line. The autonomy is safe to exactly the degree that 'done' is pinned to gates the agent can't fake.

A green compile is never accepted as done. The number gets pasted, every time — and below the floor, nothing is 'done', and that's the end of it.

11. The model pipeline: how it actually evolved

The model pipeline is where the most interesting technical evolution happened, and the most instructive failure with it. Worth its own section.

Conservative on purpose: the boring model stays as cold-start path and fallback.

The progression was deliberate and on the conservative side. Start with something boring and reliable. The first predictor was a classical, rule-based statistical filter. Not fancy, but robust, needs almost no data to cold-start, and never embarrasses you.

Earn the right to add complexity. Only once the foundation was solid did a learned sequence model go in, trained with a differential-privacy mechanism so the model itself carries a formal privacy guarantee, in keeping with the no-data-leaves-the-device rule.

Iterate the learned model against a held-out gate. A retrain tightened the privacy budget and rebalanced the training cohorts, and, crucially, it had to beat a frozen held-out benchmark to earn promotion. It did.

Keep the old model. This is the decision I'm most pleased with. The learned model didn't replace the filter. The filter stays on as the cold-start path (when there's too little data), the fallback path (when the learned model errors out), and the safe default. The system is both/and, behind a feature flag, rather than a triumphant either/or. When the prediction is the product, conservative beats clever.

The sprint that missed its own gate. One sprint went after a small accuracy gap between user cohorts. The retrained model beat the previous one on the headline metric, but fell short of a pre-registered ship threshold by about four-hundredths of a unit. Because the gate had been written down before the experiment, the answer wasn't 'eh, near enough, ship it'. It was a documented decision not to promote on accuracy, and to go after the gap at the UX layer instead. An eval you can fail is the only kind worth having; a gate you move when you miss it isn't a gate at all.

The sprint a diagnostic killed before it started. A second model got proposed — a learned predictor for a secondary physiological event. Before committing a multi-week sprint, the agent ran a cheap, falsifiable diagnostic on the available labels. The finding: the public labels for that target were about 99% derivable from a trivial arithmetic combination of fields the system already had. Training a model to predict them would mostly be training it to relearn arithmetic. The sprint got killed at the diagnostic stage, before a line of model code was written. Run the cheap experiment that might kill the expensive one first.

12. A catalogue of mistakes (and the rule each one left behind)

This is the section I most want to be straight about, because the tidy narrative above was assembled out of these mistakes. Nearly every rule in the system is a scar. Here are the ones that cost the most, and what each taught me.

I named the algorithm before I read it. Early on, the predictor was confidently described, by me, in a doc, as one well-known family of adaptive smoother. It was actually a different class of statistical filter entirely. Nobody caught it for a while because the name was plausible. Rule: read the code before you name the thing it implements. It's the first line of the 'gotchas' register now, and it generalises miles beyond algorithms.
I let the agent build from assumptions before reading the locked spec. The single most expensive recurring mistake. On the design side, the agent would generate mockups from section headers and reasonable-sounding assumptions, then have to bin them when the locked spec said otherwise — at one point inventing UI elements that simply weren't in the product. Rule: read the full locked spec and the actual shipped code first, list the real elements, and only then generate. This got hardened from a memory note into a preflight step in the design skill.
The 'fix' that opened a new silent failure. The optimistic-rollback that could leave the UI disagreeing with the database. Rule: optimistic-update rollback must reconcile against the source of truth, never a value captured before the write. Plus the meta-rule: always re-review a silent-failure fix for the new silent failure it's just introduced.
Eight pull requests merged with red CI. During one fast sprint, CI was red across eight PRs on the trot — a missing dev-dependency and some tooling noise. Local pre-commit checks kept the actual regressions out, but 'green locally, red in CI' is exactly the state where a real regression eventually slips by. Rule: two consecutive red-CI PRs trigger a mandatory infra investigation before the next merge.
I burnt half an hour on a known-broken test harness. Snapshot testing was incompatible with the specific major version of the mobile test preset we were on. Rule (and a circuit-breaker): if the same error survives two fix attempts, stop and write a five-line status — what I tried, what failed, my hypothesis, what I need, what I'd skip — instead of looping.
Shallow Object.freeze and a drifting dependency API. I froze a nested config object and assumed the whole tree was immutable (it wasn't, freeze only goes one level deep), and I called a device-storage API by the surface it had in a different version from the one installed. Both are the same error: assuming an API's behaviour instead of checking the installed version's actual surface. Both are one-line Known Pitfalls now.
Over-ceremony on trivial work. Running five personas, a critic and two reviewers over a one-line documentation change is pure waste. Rule: explicit exemptions, so doc-only and test-only changes skip the sub-agent reviewer (the suite still runs). Knowing when not to spend the ceremony is as much a skill as the ceremony itself.

The pattern across all seven: every one became a rule, and most rules became mechanical guards. The system didn't get good by being designed cleverly up front. It got good by failing, cheaply and visibly, and then refusing to fail the same way twice. That's the real engine. Not the AI, but the discipline of turning each scar into a guard the AI then can't cross.

13. The decisions that aged well

For balance, because a retrospective that's all failures is as dishonest as one that's all wins, the calls that held up:

Foundation-first sequencing. Iterating the model and the privacy architecture before the UI meant the redesign was quick and clean when it finally arrived.
One ADR per decision, with conditions for overturning it. When the product posture shifted, the agent did it surgically — it knew exactly which couplings to cut and which to leave alone, because the boundaries were written down. No re-explaining the architecture at midnight.
Pure-function extraction as the default testing strategy. Anything with real logic gets pulled into a pure, dependency-light module and tested there, leaving the framework-coupled shell thin. It's what makes a database-and-device-heavy codebase testable at all in a fast headless runner.
Privacy as an enforceable invariant, not a value statement. 'No personal field may be added to an analytics payload' is written into the kill-conditions and checkable on every diff. A constraint you can't test is one you'll eventually break.
Keeping the boring model. Both/and over either/or for the predictor. Conservatism as a feature.
Diagnostic-before-build. The cheapest sprint is the one a falsifiable diagnostic tells you not to run.
Making the agent improve its own process. The retro → register → plan → ship loop, with episodic lessons promoted into procedural guards, is the thing that compounds. The first sprints were clumsy; the later ones close a six-item register in a single session, because the process learnt.

If there's a thread through all of it, it's this: the good decisions are all about putting durable structure in place — artefacts, gates, invariants, pure modules — so that the agent's autonomy runs inside rails that get stronger over time.

14. What's still hard, honestly

It'd be dishonest to finish without the friction that hasn't gone away:

Context is the binding constraint. Long sessions hit context limits, and the checkpointing and self-orientation scaffolding exists precisely because the volatile window can't be trusted. Most of the engineering of 'autonomy' is really engineering around the memory boundary. Some of my longest, most productive sessions got cut off mid-task; the externalised checkpoint is what let the next session pick up cleanly, but the interruption is real and a bit maddening.
Ceremony has a cost. Even with the exemptions, the full machine is heavyweight. It pays for itself on multi-file, logic-bearing work and is overkill for trivia, and the calibration's never quite finished.
The agent races ahead on assumptions if you let it. It isn't solved, only guarded. The default tendency is to build from a plausible reading, and the discipline is a correction layered on top, not a change in the underlying behaviour.
Plausible-and-wrong is the characteristic failure. Not gibberish. Confident, well-argued, subtly off — the mis-named algorithm, the invented UI elements, the disproven 'sparse-logging gap', the rollback that re-introduced a silent failure. Every one was reasonable on its face. The whole adversarial apparatus exists to turn plausible-and-wrong into caught-and-corrected before it ships. You never stop needing it.

15. What actually transfers

If you want to run an agent like this, the transferable core is small:

Put your intent in durable artefacts — ADRs, specs, a live-state index, retros. The conversation is steering; the artefacts are the road.
Externalise the working memory so sessions survive context resets. Checkpoint, then orient.
Review adversarially, in fresh context. The reviewer mustn't share the implementer's blind spot. That's where the agent's own mistakes get caught.
Define 'done' as gates the agent can't fake — explicit counts, pre-registered thresholds, held-out benchmarks, privacy invariants wired into review. Never take a green compile as proof.
Promote the lessons up the memory stack. An episode that recurs becomes a rule; a rule that's structural becomes a mechanical guard. The process should get cleverer every sprint.
Keep the human on the 5% — design intent, product semantics, scope, anything irreversible. Surface those as structured choices with a recommendation, and let the agent own the rest.

The result isn't a robot that replaces engineers. It's a system — durable memory, an adversarial team, an eval stack and a human decision oracle — in which an AI can do the mechanical 95% of cross-stack engineering genuinely on its own, while the human's scarce attention goes where it's actually irreplaceable: taste, intent, and the calls that are theirs to make.

The codebase drives itself. You're not in the driver's seat any more. You're the one deciding where it's going, and you're the last line of defence on the brakes, watching a dashboard of gates the car is constitutionally unable to lie to you about. And every scar on that dashboard is somewhere the car already tried to crash, once, and now can't.

Autonomy without rails is just fast wrongness. Autonomy with rails that compound is a team.

Key takeaways

Put your intent in durable artefacts — ADRs, specs, a live-state index, retros. The conversation is steering; the artefacts are the road, and they're what let the agent rebuild context without you.
Externalise the working memory so sessions survive context resets: checkpoint, then orient. Most of the engineering of autonomy is really engineering around the memory boundary.
Review adversarially, in fresh context — the reviewer mustn't share the implementer's blind spot. It's the single highest-impact practice, and it catches the agent's own self-inflicted bugs.
Define 'done' as gates the agent can't fake: explicit test counts, pre-registered thresholds, held-out benchmarks, privacy invariants. Never take a green compile as proof.
Promote lessons up the memory stack — episode → rule → mechanical guard — and keep the human on the 5% that's genuinely theirs. The system compounds because each scar becomes a guard.

FAQs

Is it actually safe to leave an AI agent shipping code on its own?+

It's safe to exactly the degree that 'done' is pinned to gates the agent can't fake — explicit test counts, strict type-checking, pre-registered thresholds, held-out benchmarks, privacy invariants — plus fresh-context adversarial review. The autonomy isn't trust in the model; it's trust in the brakes the model isn't allowed near.

Doesn't the agent just make things up?+

The characteristic failure isn't gibberish, it's plausible-and-wrong: confident, well-argued, subtly off. A mis-named algorithm, invented UI elements, a rollback that re-introduces a silent failure. The whole apparatus — fresh-context reviewers, a critic that reads the code, pre-registered eval gates — exists to turn plausible-and-wrong into caught-and-corrected before it ships.

What's the single highest-impact practice here?+

Fresh-context, adversarial review. When the reviewer sub-agent sees only the code and the rules — not the implementer's reasoning — it doesn't share the implementer's blind spot. On more than one occasion the silent-failure hunter caught a new silent failure that the fix itself had just introduced, which an agent reviewing its own work would never have spotted.

How do you stop the agent racing ahead on a wrong assumption?+

Three things: durable artefacts so it reads settled decisions instead of guessing; the discipline of 'read the source before you name the fix' (the plan is a hypothesis, not an order); and surfacing genuine design and product decisions as structured human choices before coding starts. It's guarded, not solved — the default tendency to build from a plausible reading is strong.

Does this replace engineers?+

No. It's a system in which an AI does the mechanical 95% of cross-stack engineering on its own, while the human's scarce attention goes where it's irreplaceable: taste, intent, scope, and anything irreversible. You're not in the driver's seat; you're deciding where it's going and watching a dashboard of gates the system can't lie to you about.

Trying to run agents as a team, not autocomplete?

We build the scaffolding that makes agentic engineering safe to leave running — durable artefacts, eval gates the agent can't fake, and fresh-context adversarial review. If you're moving from chatbot coding to agents that actually ship, we can help you put the rails in.

Talk to us

About the authorVenkata Kari · Founder, GVK Technologies

This is an honest write-up of a real, multi-month build run with an AI coding agent as a standing engineering team. Product, brand, vertical and dataset names are anonymised; the stack, patterns, failures and metrics are real, just scrubbed of anything identifying.

All posts