AI Code Review in Regulated CI/CD
AI code review became useful only after we made it policy-aware, evidence-led, and subordinate to human ownership.
We tested AI code review inside a regulated CI/CD pipeline because human review was becoming the bottleneck on repeatable checks.
The goal was not to replace reviewers. The goal was to move obvious, policy-shaped, and pattern-based findings earlier in the pipeline.
The first version
The first version produced too much commentary. It spotted style issues, suggested refactors, and repeated points already covered by linters. Engineers ignored it because the signal was weak.
We changed the brief. The review agent could comment only on architectural boundaries, security-sensitive flows, missing tests, policy violations, and divergence from approved patterns.
The pipeline
Each pull request ran normal static checks first. The AI reviewer then received the diff, the owning domain, the relevant standards, and the test output.
Findings had to include evidence: file path, line context, violated rule, and suggested verification. Anything speculative was dropped.
Human reviewers kept merge authority.
What worked
The useful findings clustered around missing negative tests, accidental bypasses of policy helpers, unpinned model references, and unclear ownership of generated code.
The tool saved time only when it acted like a second reviewer with a narrow remit. When it tried to act like a general mentor, it became noise.
The lesson
AI code review belongs in regulated CI/CD when it is constrained by policy and evidence.
It should make human review sharper, not make accountability vague.
Rejected option.
We rejected broad natural-language review comments. The early reviewer tried to be helpful everywhere, and that made it easy to ignore.
A regulated pipeline needs sharper output. If a finding cannot name the control, the risk, and the evidence, it should not block an engineer.
What we tuned
We gave the reviewer a narrow checklist.
Does this change cross a domain boundary? Does it bypass an approved helper? Does it weaken audit evidence? Does it touch data classification? Does it change a prompt or model reference without tests? Does it need a rollback note?
That checklist made the output smaller and more trusted.
Production lesson.
AI review is valuable when it behaves like a control, not a commentator.
It should catch what policy and pattern knowledge can catch early. It should leave architectural judgement and merge authority with people who own the system.
That made review faster without making responsibility blurry.
The operating rule
The rule we kept was simple: the system should make the accountable path the default path.
That meant no hidden side channel, no manual exception that escaped the evidence record, and no output that could not be replayed later. If a reviewer changed the result, the change became part of the same record. If a threshold moved, the previous cases could be replayed before the change reached production.
This added a little ceremony. It removed a larger amount of ambiguity. Engineers knew what evidence the platform expected. Reviewers knew where to look. Operators knew which signal would trigger rollback.
The result was calmer delivery. The team still moved quickly, but each step left a trail strong enough for someone else to inspect weeks later.
We also wrote the failure mode into the runbook. That small step mattered. When the next exception appeared, the team did not have to rediscover the reasoning. They could see the original decision, the rejected alternative, the signal to watch, and the rollback path. That is the level of memory regulated delivery needs.
The practical value came from making the decision visible at the point where work changed hands. Engineers could see the boundary they were protecting. Reviewers could see the evidence they were accepting. Operators could see the rollback path before production pressure arrived. That shared view reduced the amount of trust the process had to borrow from memory.
The Engineering Notebook
Once a month, a long read on what we're learning building governed AI for regulated enterprises. No hot takes, no roundups.
Bugni Labs
R&D Engine
The R&D engine powering our advanced software engineering practices: platform engineering, AI-native architectures, and AI-Native Engineering methodologies for enterprise clients.
Related case studies
- Authorised payment fraud: designing for speed, signals and supervisionExperimenting with multi-agent fraud detection under tight sprint constraints.
- Building a cloud-native payment and data foundation for a new digital bankFrom concept to reference architecture, ISO20022 payments, data services and open banking adapters.
- Cloud-native credit decisioning for a digital-first bankFrom blank sheet to production-grade credit decisioning in four months.
You might also enjoy
Prompts as Production Artefacts
Production prompts need versioning, tests, ownership, and rollback because they change system behaviour as surely as code.
Field NoteWhen a Model Upgrade Breaks Production
A Gemini 2.5 Pro upgrade caused a regression in our evidence extraction pipeline. Context adherence dropped. Structured outputs degraded. The benchmarks said it was better. Our production data said otherwise.
Field NoteEngineering Agents on Production Tasks
Engineering agents helped on production work only after we narrowed scope, encoded boundaries, and measured integration quality.