GenAI's trough is a governance gap

I watch financial services leaders move through the same pattern. First came the pressure to adopt generative AI quickly. Then came the pilots: copilots, document assistants, service chat, compliance summaries, code generation experiments. Now comes the harder question: why did so little of it change the operating model?

The GenAI trough of disillusionment is not evidence that the technology failed. It is evidence that most institutions tried to attach non-deterministic intelligence to software lifecycles that were never designed to contain it.

The next phase belongs to institutions that stop treating AI as a feature and start treating it as an engineering constraint. In regulated finance, AI becomes useful at scale only when it is surrounded by domain boundaries, runtime governance, evidence capture, and reversible change. The model is not the product. The governed system around it is.

Pilot fatigue is a governance signal

Pilot fatigue is real. Executives are tired of sandbox demos that cannot cross the line into production. Engineering teams are tired of prototypes that need a special data path, a manual review queue, and a bespoke exception process. Compliance teams are tired of being asked to approve systems that cannot explain themselves.

The problem is not that early use cases lacked imagination. Many were sensible: drafting customer-service responses, summarising complaints, triaging financial crime alerts, generating test cases, helping engineers understand old code, and producing regulatory narratives. The problem was that they sat outside the operating fabric of the institution.

A pilot can succeed with curated data, friendly users, and manual oversight. Production is different. Production needs identity, access control, observability, auditability, exception handling, rollback, versioning, data lineage, and operational resilience. Without those controls, even a useful model becomes a risk surface.

That is why the trough should be read as a governance signal. The industry has learned that capability without containment does not scale.

The mistake was model-first thinking

The first wave of generative AI adoption was model-first. Teams asked which model to use, which vendor to trust, which assistant to deploy, or which prompt pattern worked best. Those questions were understandable, but they put the abstraction in the wrong place.

A financial institution does not ship a model. It ships a decision, an action, a document, a code change, or an operational response. Each of those artefacts has a policy context, an evidence trail, and a failure mode. The system must govern the artefact, not admire the model that produced it.

Model-first thinking also created vendor dependency. Banks paid for licences, wrapped APIs, and built one-off integrations that were hard to replace. When the provider changed the model, the reasoning pattern, the price, or the product roadmap, the bank absorbed the risk.

System-first thinking is different. It asks what the AI is allowed to do, what evidence it must use, which domain it can act inside, how outputs are validated, and how the institution reverses or blocks an action. The model can then improve or change without destabilising the workflow.

Governance has to move into runtime

Traditional governance is too slow for agentic systems. A monthly model-risk committee cannot review every generated code change, screening decision, or policy interpretation. Human oversight remains essential, but it has to be placed where it creates control.

The practical shift is from human-in-the-loop everywhere to human authority over the loop. Architects, risk owners, and compliance specialists define the constraints. Runtime systems enforce them. Exceptions route to humans when a threshold is crossed, evidence is incomplete, or the action would affect a customer, ledger, or regulatory position.

In a screening workflow, this might mean that the system can auto-clear low-risk false positives when evidence is complete and policy permits it. It must escalate ambiguous matches, retain the evidence bundle, and record the policy version in force. In a credit workflow, it might mean that the model can help interpret affordability evidence, but cannot alter eligibility thresholds without a governed release path.

This is where domain-driven boundaries matter. An agent operating in complaints handling should not be able to change payment routing. A code-generation agent working on a servicing module should not modify ledger logic. A regulatory narrative agent should not invent evidence that the source systems cannot prove.

Governance by audit catches problems late. Governance by design prevents many of them from reaching production.

Agentic systems need an operating platform

The industry is moving from basic chat interfaces to orchestrated workflows. That shift is unavoidable. The productivity gain does not come from asking a model a question. It comes from letting agents plan, retrieve, generate, test, validate, and route work inside a controlled platform.

But agentic systems multiply risk when the platform is weak. An agent that can call tools, write code, query data, and trigger actions needs tighter constraints than a passive assistant. It needs identity, scoped authority, state management, audit evidence, and deterministic gates.

In practice, the strongest financial-services implementations use AI inside an engineering lifecycle that is already structured. Intent is clarified before work starts. Specifications are explicit. Generated artefacts are tested. Validation is automated where possible. Operations capture evidence. Evolution happens through versioned change rather than ad hoc patching.

This kind of discipline can compress delivery timelines. In a digital challenger bank context, teams have delivered 20 microservices for a credit decisioning platform in four months because generation happened inside defined domains and review gates. The agents accelerated implementation, but the architecture controlled the work.

The counter-argument: pilots were necessary

There is a fair defence of the first wave. Pilots taught financial institutions what the technology could and could not do. They helped legal, risk, engineering, and operations teams build a shared vocabulary. They exposed data-access problems that had been ignored for years. In that sense, the demos were not wasted.

But a pilot is useful only if it changes the next investment decision. Too many programmes treated proof of possibility as proof of readiness. They showed that a model could draft a response, summarise a case, or generate code. They did not prove that the institution could govern that output inside a production system.

The next phase should keep the learning and drop the theatre. A pilot should be judged by whether it identifies the platform capabilities needed for production: data contracts, test harnesses, runtime gates, review routes, incident handling, and rollback. If it cannot name those requirements, it is entertainment.

Platform engineering decides who escapes the trough

AI-native methods cannot sit on fragile infrastructure. Financial services workloads need elastic capacity, predictable recovery, data isolation, and audit-grade observability. Event-driven architecture is especially important because it lets teams decouple reasoning from action.

A screening platform, for example, should not be hard-wired to one provider or one model. Events should carry the work. Providers should be routed through clear contracts. Evidence should be captured as part of the flow. If a provider fails or a model changes, the system should queue, reroute, or fall back without compromising the core process.

This is how vendor-agnostic orchestration produces financial value. Institutions can reduce total cost of ownership by avoiding rigid licence stacks and unused capacity. They can change providers without rebuilding core systems. They can test a new reasoning engine in shadow mode before it touches production.

The same foundation also reduces operational risk. When systems are event-driven and observable, failures can be isolated. If an agent goes offline, work can queue. If an output fails validation, the action can stop. If a deployment fails, traffic can return to a known path.

What leaders should do differently

The lesson from the trough is not to slow down. It is to move the investment from pilots to platforms.

CIOs and engineering leaders should ask four questions of every GenAI initiative.

First, what decision or workflow will this change in production? If the answer is only “productivity”, the initiative may still help, but it is not a strategic platform bet.

Second, what evidence will prove the system acted correctly? If the system cannot reconstruct its decision path, it is not ready for regulated use.

Third, which domain boundaries limit the AI's authority? If the tool can reach across the platform without scoped control, it is a liability.

Fourth, how do we reverse the change? If rollback depends on a heroic manual recovery, the platform is not mature enough.

These questions expose whether a programme is AI-enhanced theatre or real engineering change.

Leaders should also change funding mechanics. Do not fund ten isolated pilots that each invent their own governance pattern. Fund two production paths and the shared platform capabilities they need. The second approach looks slower for the first quarter. It is faster by the second because every new use case inherits the same controls.

The next phase

The next phase of financial-services AI will be less theatrical and more useful. Full-book rescreening, regulatory narrative generation, code-modernisation support, complaint triage, credit policy testing, and operational exception routing will create more value than another generic assistant.

The institutions that escape the trough will share one trait: they will own the operating fabric around AI. They will treat models as replaceable components inside governed systems. They will measure success through delivery velocity, cost reduction, incident avoidance, audit quality, and system longevity.

The trough is uncomfortable, but it is clarifying. It shows which organisations were buying AI as a symbol and which ones are prepared to engineer it as a capability. Financial services does not need more demos. It needs systems that can explain, reverse, and endure.