When a Model Upgrade Breaks Production

We run a natural language evidence extraction service that pulls structured findings from fraud investigation reports: dense, multi-page documents where precise language carries legal weight. The pipeline extracts entities, transaction patterns, and evidential statements, returning them as structured JSON with confidence scores and source provenance.

It had been running reliably on gemini-2.5-pro-preview-05-06 for several months.

When Google released the GA version of Gemini 2.5 Pro in June 2025, we upgraded. The benchmarks were better across the board. Latency improved. The upgrade looked routine.

What broke

Within hours, extraction quality shifted. The failure mode was subtle: not errors, but degradation.

Before (preview model): Given a fraud investigation report describing a transaction monitoring alert, the model extracted:

{
  "finding": "Layered transaction pattern consistent with structuring across three accounts",
  "evidence_type": "behavioural_pattern",
  "confidence": 0.94,
  "source_paragraph": 3,
  "entities": ["structuring_pattern", "account_cluster", "transaction_layering"]
}

After (GA model): The same input produced:

{
  "finding": "Suspicious transaction activity detected",
  "evidence_type": "general_alert",
  "confidence": 0.87,
  "source_paragraph": 3,
  "entities": ["transaction", "alert"]
}

Technically valid JSON. Schema-compliant. But the specificity that made the output useful (structuring_pattern, account_cluster, the distinction between behavioural_pattern and general_alert) was lost. The model had generalised where we needed it to be precise.

The confidence score dropped from 0.94 to 0.87, which our threshold checks allowed through. The domain specialists reviewing the outputs caught it within the first batch.

Why it happened

The GA release of Gemini 2.5 Pro optimised for broader reasoning and reduced hallucination across general tasks. But context adherence in domain-specific extraction (our exact use case) regressed. This was subsequently documented in the Google AI developer forums as a known regression pattern in the 2.5 Pro GA release, specifically around "context ignorance and persona adherence."

In practice: the model became better at general reasoning and worse at following precise extraction instructions in dense, domain-specific text. It traded depth for breadth.

What we changed

Pinned model versions. We stopped using auto-upgrading aliases. Our pipeline now references gemini-2.5-pro-preview-05-06 explicitly. Version upgrades are a deliberate engineering decision with a test cycle.

Domain-specific evaluation harness. We built a regression suite from 200+ real production inputs with verified expected outputs. Every candidate model version runs through this harness. We measure:

Entity extraction precision and recall
Evidence type classification accuracy
Confidence score distribution (mean, variance, drift)
Source paragraph attribution accuracy

The harness runs in CI. A model version that passes generic benchmarks but fails domain-specific evaluation does not reach production.

Canary deployment for model changes. Model version changes now follow the same pattern as code deployments. 10% of traffic goes to the candidate version. We compare output distributions against the baseline for 24 hours before promoting.

Continuous output quality monitoring. We added observability on extraction metrics, not just latency and error rates. Dashboards track:

Average entity count per extraction (drop = generalisation)
Evidence type distribution (shift toward general_alert = regression)
Confidence score distribution (compression = model uncertainty increasing)
Field completeness rate (percentage of expected fields populated)

A model that returns valid JSON with degraded content is harder to catch than one that throws errors. These metrics make the invisible visible.

Practical advice

If you are running AI extraction, generation, or classification in production:

Pin your model versions. gemini-pro-latest is a convenience for prototyping. It is a risk in production. The same applies to any provider's auto-upgrading alias.
Build domain-specific evaluation. Generic benchmarks measure general capability. Your production pipeline depends on specific capability. The gap between the two is where regressions hide.
Treat model changes like code changes. Review. Test. Canary. Promote. The same governance that protects your code should protect your model dependencies.
Monitor output quality, not just availability. A 200 OK with semantically degraded content is the hardest failure mode in AI systems, and in a regulated context, the most consequential.

The gap between benchmark and production is where the risk lives.

The model upgrade was an improvement by every published metric. It was a regression for our specific task. That gap is structural: it exists for every foundation model, every upgrade, every provider. The engineering response is to make it observable.

When a Model Upgrade Breaks Production

What broke

Why it happened

What we changed

Practical advice

The Engineering Notebook

You might also enjoy