Prompts in Production - Field Note

We stopped treating prompts as content once they started changing production behaviour.

A prompt that routes a case, drafts an evidence summary, or extracts a policy field is part of the system. If it changes, behaviour changes. If behaviour changes, the change needs review, tests, deployment records, and rollback.

The old failure mode

The first versions lived in configuration fields. Engineers edited them directly. Product owners tuned wording after user feedback. The model output improved in one case and regressed in another.

No one could say which prompt version produced which result.

That was the signal to move prompts into the engineering lifecycle.

The new shape

Prompts now live beside code. Each prompt has an owner, a purpose, allowed inputs, expected output schema, and evaluation set.

A prompt change opens a pull request. The evaluation suite runs against frozen examples and recent anonymised cases. The deployment record links prompt version, model version, test result, and reviewer.

If production signal drifts, rollback is a version change.

The lesson

Prompts are small, but they are not casual. In AI-shaped systems, language is behaviour.

Treating prompts as production artefacts gave us less theatre and more control.

Rejected option.

We rejected storing prompts in an admin screen without review. It made iteration easy, but it separated behaviour from the rest of the system.

That separation created a control gap. A prompt could change the output schema, the risk tone, or the evidence threshold while code and tests stayed unchanged.

What we tested

Each prompt gained a small evaluation suite. Some examples were golden cases. Some were edge cases. Some were deliberately awkward inputs that had caused mistakes before.

We tested schema conformance, groundedness, refusal behaviour, confidence shape, and whether the output preserved the evidence fields required by the workflow.

The tests were not perfect. They were enough to stop casual edits from becoming production behaviour.

Production lesson.

Prompt governance is software governance with different syntax.

Version the prompt. Test the prompt. Review the prompt. Link the prompt to the model version and the production signal it affects. Anything less turns language into an uncontrolled release path.

The operating rule

The rule we kept was simple: the system should make the accountable path the default path.

That meant no hidden side channel, no manual exception that escaped the evidence record, and no output that could not be replayed later. If a reviewer changed the result, the change became part of the same record. If a threshold moved, the previous cases could be replayed before the change reached production.

This added a little ceremony. It removed a larger amount of ambiguity. Engineers knew what evidence the platform expected. Reviewers knew where to look. Operators knew which signal would trigger rollback.

The result was calmer delivery. The team still moved quickly, but each step left a trail strong enough for someone else to inspect weeks later.

We also wrote the failure mode into the runbook. That small step mattered. When the next exception appeared, the team did not have to rediscover the reasoning. They could see the original decision, the rejected alternative, the signal to watch, and the rollback path. That is the level of memory regulated delivery needs.

The practical value came from making the decision visible at the point where work changed hands. Engineers could see the boundary they were protecting. Reviewers could see the evidence they were accepting. Operators could see the rollback path before production pressure arrived. That shared view reduced the amount of trust the process had to borrow from memory. The prompt became reviewable infrastructure. That made rollback ordinary. That changed delivery.

Prompts as Production Artefacts

The old failure mode

The new shape

The lesson

What we tested

The operating rule

The Engineering Notebook

You might also enjoy