Building Governed AI Delivery Pipelines: How We Structured Ours for Regulated Financial Services

We build software for banks. Not the kind of banks that move fast and break things - the kind where a misclassified transaction can trigger a regulatory investigation, and where every deployment decision sits inside a chain of accountability that stretches from engineering to the board.

When we started embedding AI components into delivery pipelines for regulated clients, the question was never whether we needed governance. It was how to make governance fast enough that it did not become an excuse to avoid shipping.

This is how we structured a governed AI delivery pipeline for a UK retail bank. It took three iterations to get right. The first version was too slow. The second had a blind spot we did not catch until production. The third is what we run today.

The starting constraint

The bank's model risk management framework required every AI-assisted component to pass through three gates before it could touch production data:

A validation gate confirming the component behaves within documented parameters.
A policy-as-code gate confirming it complies with the bank's data handling, fairness, and explainability requirements.
A human review checkpoint where a named individual signs off on deployment.

These gates already existed for traditional ML models - credit scoring, fraud detection - but they assumed quarterly release cadences. We were shipping weekly. Sometimes daily. The existing gates added fourteen days to every deployment.

Fourteen days is not governance. It is a queue.

What we built

The pipeline has four stages. We run it on GitHub Actions with self-hosted runners inside the bank's private cloud. We evaluated Azure DevOps (the bank's default) and rejected it - the conditional job syntax made policy-as-code checks harder to express, and the approval gates did not support programmatic overrides for low-risk changes.

Stage 1: Build and unit validation

Standard. Container builds, unit tests, contract tests against downstream services. Nothing AI-specific here. Median time: 4 minutes.

Stage 2: Behavioural validation

This is where it gets interesting. Every AI component ships with a validation suite - a set of input/output pairs that define expected behaviour within tolerance bands. Not accuracy benchmarks. Behavioural contracts.

The validation suite for our evidence extraction service looks like this:

{
  "suite": "evidence-extraction-v3",
  "model": "gemini-2.5-pro-preview-05-06",
  "assertions": [
    {
      "id": "entity-extraction-precision",
      "input_hash": "a8f3c2d1",
      "expected_entities": ["transaction_id", "account_holder", "beneficiary", "amount", "date"],
      "min_precision": 0.92,
      "min_recall": 0.88
    },
    {
      "id": "confidence-calibration",
      "description": "High-confidence outputs should be correct >95% of the time",
      "threshold": 0.85,
      "expected_accuracy_above_threshold": 0.95
    },
    {
      "id": "hallucination-guard",
      "description": "No entity should appear in output that cannot be traced to input span",
      "max_ungrounded_rate": 0.02
    }
  ]
}

We run the full suite against a frozen evaluation dataset. Not production data - a curated set of 340 documents that cover the known edge cases. The suite takes about seven minutes. If any assertion fails, the pipeline stops. No exceptions, no overrides.

We tried making the thresholds configurable per-environment (looser in staging, strict in production). That was a mistake. Engineers started treating staging failures as noise. Within two weeks, a component that consistently failed the hallucination guard in staging made it to the production approval queue. We caught it at the human review gate, but it should never have got that far.

Now the thresholds are identical everywhere. A failure in staging is a failure.

Stage 3: Policy-as-code

We use Open Policy Agent. The policies are stored in the same repository as the application code, versioned together, reviewed together. Separating policy from code was something we considered and rejected - when policies live in a central repository managed by a different team, drift is inevitable.

The policies cover four domains:

Data residency. Every AI component declares which data classifications it processes. The policy checks that the deployment target matches the data residency requirements. A component processing PII cannot deploy to a region outside the bank's approved list.

Explainability. Every AI output must include a provenance chain - the input spans that contributed to each output field. The policy checks that the output schema includes provenance metadata. It does not validate the quality of the explanations; that happens in behavioural validation.

Fairness. For components that make or influence decisions about customers, the policy checks that the validation suite includes disaggregated performance metrics across protected characteristics. If the suite does not include these checks, the deployment is blocked.

Model versioning. Every component pins its model version explicitly. No floating references, no "latest". The policy rejects any configuration that references a model version not in the approved registry.

A typical policy check:

package ai.deployment

deny[msg] {
    input.model_config.version == "latest"
    msg := "Model version must be pinned. 'latest' is not permitted."
}

deny[msg] {
    input.data_classification == "pii"
    not input.deployment_region in data.approved_regions
    msg := sprintf("PII component cannot deploy to %s", [input.deployment_region])
}

deny[msg] {
    input.component_type == "decision_support"
    not input.validation_suite.disaggregated_metrics
    msg := "Decision-support components require disaggregated fairness metrics."
}

Policy evaluation takes under ten seconds. The entire stage, including fetching the approved model registry from the bank's internal API, takes about thirty seconds.

Stage 4: Human review

A named reviewer from the bank's AI governance team approves the deployment. We pushed hard to make this asynchronous and fast. The reviewer sees a deployment summary - not raw logs, but a structured report:

Component:           evidence-extraction-v3.2.1
Model:               gemini-2.5-pro-preview-05-06
Validation:          PASSED (all 12 assertions, 7m14s)
Policy:              PASSED (4/4 domains)
Data classification: CONFIDENTIAL
Deployment target:   uk-south-prod
Change summary:      Prompt revision for entity disambiguation.
                     No schema changes. No model version change.
Risk assessment:     LOW (prompt-only change, model pinned)

Low-risk changes - prompt revisions, threshold adjustments, configuration changes with no model version change - get reviewed within two hours. The reviewer has a Slack integration that pings them with the summary and a one-click approve button.

High-risk changes - model version upgrades, schema changes, new data classification scopes - require a synchronous review with the engineering team. These happen in a scheduled slot, twice per week.

The three failure modes

We discovered these in the first month. All three were things we thought we had covered.

Failure mode 1: Silent model degradation

We pinned model versions. We ran validation suites. But we did not account for provider-side changes that happen without a version bump. In one case, a model provider updated their safety filters, which changed the output distribution for a subset of our evaluation documents. Precision on entity extraction dropped from 0.94 to 0.87. The validation suite caught it - but only because we run it on every deployment. If we had been running it on a schedule (weekly, say), the degradation would have been live for days.

The fix: we now run the validation suite on a daily cron, even when nothing is being deployed. If the suite fails, we get an alert and automatically roll the component back to the last known-good configuration. This has fired twice in six months.

Failure mode 2: Policy drift on approved model registry

The bank's approved model registry is maintained by a separate team. When a model version is deprecated, they remove it from the registry. Our policy-as-code check validates against the registry at deployment time. But it does not validate running components. We had a production component running a model version that had been removed from the approved registry three weeks earlier.

The fix: a daily reconciliation job that checks every running component's model version against the current registry. Unapproved versions trigger an alert and a 48-hour remediation window.

Failure mode 3: Human review bottleneck

The two-hour SLA for low-risk reviews worked for the first three weeks. Then the volume of deployments increased, and the single reviewer became a bottleneck. Average review time crept to six hours, then to next-day.

The fix was not adding more reviewers. It was reducing the number of changes that required human review. We introduced a "no-review" path for changes that met all of the following: prompt-only change, no model version change, validation suite pass rate identical to previous deployment, policy check pass. These changes deploy automatically with a 24-hour canary window. If any monitoring threshold breaches during the canary, the deployment rolls back and a human review is triggered retroactively.

About 60% of our deployments now take the no-review path. Median time from commit to production for these changes: 22 minutes. For reviewed changes: 3.5 hours.

What we would do differently

We over-invested in the human review stage and under-invested in automated canary analysis. The canary window was an afterthought - bolted on when the review bottleneck became painful. If we were starting again, we would design the canary as the primary safety mechanism and treat human review as an escalation path, not a default gate.

We also spent too long evaluating OPA alternatives. We looked at Cedar (AWS's policy language), Kyverno (Kubernetes-native), and a custom DSL the bank's platform team had built. OPA won because it had the broadest adoption, the most documentation, and - critically - the bank's security team already understood it. The best policy engine is the one your reviewers can read.

Governance does not have to be slow. But it has to be structural. Bolting governance onto an existing pipeline as an afterthought produces theatre - checkboxes that create the appearance of control without the substance. Building it into the pipeline from the start, with the same engineering rigour you would apply to any other system component, produces something that actually works.

Pin your model versions. Validate daily, not just at deployment. Automate everything that can be automated, and make the human review step fast, focused, and well-informed.

Building Governed AI Delivery Pipelines: How We Structured Ours for Regulated Financial Services

The starting constraint

What we built

The three failure modes

What we would do differently

You might also enjoy