Engineering Agents - Field Note

We gave engineering agents real production tasks on a regulated delivery programme and learned quickly where the boundary belongs.

The agents were strong at scaffolding services, producing first-pass tests, and following existing patterns. They were weaker when the task required unstated domain knowledge, inherited interface conventions, or judgement about risk.

The assignment

We started with bounded work: generate service skeletons from approved OpenAPI contracts, write mapping tests, propose validation cases, and update runbooks.

Human architects owned the domain model. Engineers owned merge authority. Agents operated inside the repository rules.

That division mattered.

What failed

The first failures were plausible. An agent invented a validation rule with a convincing name. Another dropped an optional field that was mandatory in a legacy integration. A third wrote tests that proved the generated code rather than the business rule.

The code looked clean. The intent was wrong.

What worked

We added narrower specs, contract tests, and canary checks. Agent output had to pass the same policy and integration gates as human output.

Velocity improved on repeatable work. Review effort moved from syntax to intent. The platform became the control, not the agent.

The lesson

Agents multiply the quality of the system around them. Strong specifications and strong tests turn them into useful throughput. Weak boundaries turn them into faster rework.

Rejected option.

We rejected giving agents whole feature tickets. The tickets were clear to humans because they carried shared context, but the context was not explicit enough for an agent.

When agents received broad tickets, they filled gaps with plausible assumptions. Those assumptions were the risk.

What we changed

We rewrote tasks as bounded specifications. Each task named the files in scope, the contract to preserve, the tests to add, the forbidden changes, and the evidence required for review.

That made the agent less creative and more useful.

We also separated author and reviewer roles. One agent could draft code, but another check, plus human review, had to inspect the result against policy and domain intent.

Production lesson.

Engineering agents are strongest when the organisation has already done the thinking.

They accelerate known patterns, expand tests, and carry mechanical work. They are weakest when asked to infer the shape of a regulated system from thin instructions.

The better the specification, the more valuable the agent.

The operating rule

The rule we kept was simple: the system should make the accountable path the default path.

That meant no hidden side channel, no manual exception that escaped the evidence record, and no output that could not be replayed later. If a reviewer changed the result, the change became part of the same record. If a threshold moved, the previous cases could be replayed before the change reached production.

This added a little ceremony. It removed a larger amount of ambiguity. Engineers knew what evidence the platform expected. Reviewers knew where to look. Operators knew which signal would trigger rollback.

The result was calmer delivery. The team still moved quickly, but each step left a trail strong enough for someone else to inspect weeks later.

The practical value came from making the decision visible at the point where work changed hands. Engineers could see the boundary they were protecting. Reviewers could see the evidence they were accepting. Operators could see the rollback path before production pressure arrived. That shared view reduced the amount of trust the process had to borrow from memory.

That is why we kept the scope narrow. A narrow scope made the evidence stronger, the review simpler, and the next change easier to reason about. Small tasks carried better evidence. That was the point.

Engineering Agents on Production Tasks