Engineering Agents on Production Tasks
Engineering agents helped on production work only after we narrowed scope, encoded boundaries, and measured integration quality.
We gave engineering agents real production tasks on a regulated delivery programme and learned quickly where the boundary belongs.
The agents were strong at scaffolding services, producing first-pass tests, and following existing patterns. They were weaker when the task required unstated domain knowledge, inherited interface conventions, or judgement about risk.
The assignment
We started with bounded work: generate service skeletons from approved OpenAPI contracts, write mapping tests, propose validation cases, and update runbooks.
Human architects owned the domain model. Engineers owned merge authority. Agents operated inside the repository rules.
That division mattered.
What failed
The first failures were plausible. An agent invented a validation rule with a convincing name. Another dropped an optional field that was mandatory in a legacy integration. A third wrote tests that proved the generated code rather than the business rule.
The code looked clean. The intent was wrong.
What worked
We added narrower specs, contract tests, and canary checks. Agent output had to pass the same policy and integration gates as human output.
Velocity improved on repeatable work. Review effort moved from syntax to intent. The platform became the control, not the agent.
The lesson
Agents multiply the quality of the system around them. Strong specifications and strong tests turn them into useful throughput. Weak boundaries turn them into faster rework.
Rejected option.
We rejected giving agents whole feature tickets. The tickets were clear to humans because they carried shared context, but the context was not explicit enough for an agent.
When agents received broad tickets, they filled gaps with plausible assumptions. Those assumptions were the risk.
What we changed
We rewrote tasks as bounded specifications. Each task named the files in scope, the contract to preserve, the tests to add, the forbidden changes, and the evidence required for review.
That made the agent less creative and more useful.
We also separated author and reviewer roles. One agent could draft code, but another check, plus human review, had to inspect the result against policy and domain intent.
Production lesson.
Engineering agents are strongest when the organisation has already done the thinking.
They accelerate known patterns, expand tests, and carry mechanical work. They are weakest when asked to infer the shape of a regulated system from thin instructions.
The better the specification, the more valuable the agent.
The operating rule
The rule we kept was simple: the system should make the accountable path the default path.
That meant no hidden side channel, no manual exception that escaped the evidence record, and no output that could not be replayed later. If a reviewer changed the result, the change became part of the same record. If a threshold moved, the previous cases could be replayed before the change reached production.
This added a little ceremony. It removed a larger amount of ambiguity. Engineers knew what evidence the platform expected. Reviewers knew where to look. Operators knew which signal would trigger rollback.
The result was calmer delivery. The team still moved quickly, but each step left a trail strong enough for someone else to inspect weeks later.
The practical value came from making the decision visible at the point where work changed hands. Engineers could see the boundary they were protecting. Reviewers could see the evidence they were accepting. Operators could see the rollback path before production pressure arrived. That shared view reduced the amount of trust the process had to borrow from memory.
That is why we kept the scope narrow. A narrow scope made the evidence stronger, the review simpler, and the next change easier to reason about. Small tasks carried better evidence. That was the point.
The Engineering Notebook
Once a month, a long read on what we're learning building governed AI for regulated enterprises. No hot takes, no roundups.
Bugni Labs
R&D Engine
The R&D engine powering our advanced software engineering practices: platform engineering, AI-native architectures, and AI-Native Engineering methodologies for enterprise clients.
Related case studies
- Building a cloud-native payment and data foundation for a new digital bankFrom concept to reference architecture, ISO20022 payments, data services and open banking adapters.
- Cloud-native credit decisioning for a digital-first bankFrom blank sheet to production-grade credit decisioning in four months.
You might also enjoy
AI Code Review in Regulated CI/CD
AI code review became useful only after we made it policy-aware, evidence-led, and subordinate to human ownership.
Field NotePrompts as Production Artefacts
Production prompts need versioning, tests, ownership, and rollback because they change system behaviour as surely as code.
Field NoteRunning the agent-fabric Locally with Docker Compose
How to run agent-fabric locally with Docker Compose while keeping the same gateway, auth, agent, and web code paths used in production.