Field NoteEngineering · Intermediate · 3 min read

AI Code Review in Regulated CI/CD

AI code review became useful only after we made it policy-aware, evidence-led, and subordinate to human ownership.

Bugni Labs
Share

We tested AI code review inside a regulated CI/CD pipeline because human review was becoming the bottleneck on repeatable checks.

The goal was not to replace reviewers. The goal was to move obvious, policy-shaped, and pattern-based findings earlier in the pipeline.

The first version

The first version produced too much commentary. It spotted style issues, suggested refactors, and repeated points already covered by linters. Engineers ignored it because the signal was weak.

We changed the brief. The review agent could comment only on architectural boundaries, security-sensitive flows, missing tests, policy violations, and divergence from approved patterns.

The pipeline

Each pull request ran normal static checks first. The AI reviewer then received the diff, the owning domain, the relevant standards, and the test output.

Findings had to include evidence: file path, line context, violated rule, and suggested verification. Anything speculative was dropped.

Human reviewers kept merge authority.

What worked

The useful findings clustered around missing negative tests, accidental bypasses of policy helpers, unpinned model references, and unclear ownership of generated code.

The tool saved time only when it acted like a second reviewer with a narrow remit. When it tried to act like a general mentor, it became noise.

The lesson

AI code review belongs in regulated CI/CD when it is constrained by policy and evidence.

It should make human review sharper, not make accountability vague.

Rejected option.

We rejected broad natural-language review comments. The early reviewer tried to be helpful everywhere, and that made it easy to ignore.

A regulated pipeline needs sharper output. If a finding cannot name the control, the risk, and the evidence, it should not block an engineer.

What we tuned

We gave the reviewer a narrow checklist.

Does this change cross a domain boundary? Does it bypass an approved helper? Does it weaken audit evidence? Does it touch data classification? Does it change a prompt or model reference without tests? Does it need a rollback note?

That checklist made the output smaller and more trusted.

Production lesson.

AI review is valuable when it behaves like a control, not a commentator.

It should catch what policy and pattern knowledge can catch early. It should leave architectural judgement and merge authority with people who own the system.

That made review faster without making responsibility blurry.

The operating rule

The rule we kept was simple: the system should make the accountable path the default path.

That meant no hidden side channel, no manual exception that escaped the evidence record, and no output that could not be replayed later. If a reviewer changed the result, the change became part of the same record. If a threshold moved, the previous cases could be replayed before the change reached production.

This added a little ceremony. It removed a larger amount of ambiguity. Engineers knew what evidence the platform expected. Reviewers knew where to look. Operators knew which signal would trigger rollback.

The result was calmer delivery. The team still moved quickly, but each step left a trail strong enough for someone else to inspect weeks later.

We also wrote the failure mode into the runbook. That small step mattered. When the next exception appeared, the team did not have to rediscover the reasoning. They could see the original decision, the rejected alternative, the signal to watch, and the rollback path. That is the level of memory regulated delivery needs.

The practical value came from making the decision visible at the point where work changed hands. Engineers could see the boundary they were protecting. Reviewers could see the evidence they were accepting. Operators could see the rollback path before production pressure arrived. That shared view reduced the amount of trust the process had to borrow from memory.

Was this useful?
Share

The Engineering Notebook

Once a month, a long read on what we're learning building governed AI for regulated enterprises. No hot takes, no roundups.

Prefer to talk it through?

Bugni Labs

R&D Engine

The R&D engine powering our advanced software engineering practices: platform engineering, AI-native architectures, and AI-Native Engineering methodologies for enterprise clients.

Related case studies