Scale agent-fabric in production · Field Note

Multi-provider, multi-agent: scaling the agent-fabric in production

What changes when you have real traffic, two providers, a second agent on the way, and an LLM bill that's now a line item.

Code: github.com/FintelligenX/agent-fabric

In post 1 we covered the design tenets. In post 2 we ran the stack locally. In post 3 we put it on GCP. This post is about what happens after you've shipped: the operational decisions that emerge when the platform is no longer a thing you're building but a thing you're running.

Three big themes:

Provider lifecycle. Why Gemini is the default, what the Anthropic opt-in looks like, how the provider router actually translates between two incompatible LLM APIs without leaking into the rest of the platform.
Cost as a first-class concern. Real numbers, real trade-offs, real levers. The 47% saving from defaulting to Gemini, the 10× swing within Gemini's own model lineup, the prompt-cache patterns that turn a £400/month bill into £200.
Multi-agent shape. What "add a second agent" looks like in practice: the agent.yaml contract, the tool registry, the boundaries that make this a true platform rather than a single-agent service.

These aren't day-one decisions; they're the decisions that prove the platform was designed correctly. If everything below feels obvious in retrospect, the day-one architecture was right.

The provider router: 100 lines that shouldn't matter

The most interesting code in the whole platform is also the least visible: platform/agent-svc/src/providers/. It's an abstraction that took less than a day to build and made every operational decision since then easier.

The whole thing is one ABC and three implementations:

Code samplepython

# providers/base.py
class LLMProvider(ABC):
    @abstractmethod
    async def create_message(
        self, *, model, system, messages, tools, max_tokens,
    ) -> dict:
        """Return Anthropic-shaped:
           {'content': [...], 'stop_reason': ..., 'usage': {...}}"""

One method. Five keyword-only arguments. One return shape: deliberately the Anthropic Messages API shape, because the platform was built with that first and migrating away from a working contract is more expensive than translating into it.

Three implementations live behind that ABC:

AnthropicDirectProvider: wraps anthropic.AsyncAnthropic, calls messages.create, serialises the SDK objects into dicts.
GeminiProvider: uses google-genai (google.genai.Client), supports both Vertex AI ADC and Developer API key modes, runs every request through a translator.
_gemini_translate.py: pure functions: to_gemini_contents, to_gemini_tools, to_gemini_system_instruction, from_gemini_response. Zero SDK imports. Unit-tested in isolation.

Dispatch is by model id prefix:

Code samplepython

# providers/__init__.py
def select_provider(model_id: str) -> LLMProvider:
    if model_id.startswith("claude-"):
        return get_anthropic_provider()
    return get_gemini_provider()

claude-* → Anthropic; anything else → Gemini. The default is Gemini because Gemini's the fallthrough, not because the code special-cases it. Any future provider, Bedrock, Azure OpenAI, or a local llama.cpp endpoint, slots in by adding a prefix and a class. The rest of the platform doesn't need to know.

The thing that makes this work is the translator. Anthropic and Gemini diverge in three places that matter and ten places that don't.

Messages → Contents. Anthropic's messages is [{role, content}] where content can be a string, a list of {type: text, text} blocks, or include {type: tool_use, id, name, input} and {type: tool_result, tool_use_id, content}. Gemini's contents is [Content(role, parts)] where role is "user" or "model" (not "assistant") and parts is a list of {text}, {function_call: {name, args}}, or {function_response: {name, response}}. The translator walks the message list once, builds the parts list, and resolves an annoying detail: Gemini's function_response references the called function by name, while Anthropic's tool_result references it by id. The translator maintains an id → name map as it goes, populated from prior tool_use blocks. Single pass, no extra state escapes the function.

System prompt → system_instruction. Anthropic supports a system prompt as a list of blocks with cache_control annotations for prompt caching. Gemini takes a plain system_instruction string and caches implicitly. The translator joins the Anthropic block list with newlines, drops the cache_control hints (Gemini ignores them anyway), and passes the result. Cache behaviour is preserved across providers because both ends do prompt caching for prefixes ≥ a threshold; the metadata is just no longer needed.

Tools → function_declarations. Anthropic's tool shape {name, description, input_schema} maps almost 1:1 to Gemini's FunctionDeclaration shape {name, description, parameters}: the only thing the translator does is rename input_schema to parameters and wrap the list in a Gemini Tool(function_declarations=[...]) envelope.

Response → response. Gemini returns a GenerateContentResponse with candidates[0].content.parts. The translator pulls out text parts and function_call parts and rebuilds the Anthropic content list. Two subtle pieces:

IDs are synthesised. Gemini doesn't return stable tool-call IDs. The translator generates toolu_<random> on emit so the downstream tool-result block can reference it, and the same id-to-name map handles the next round trip.
Usage tokens are reconciled. Gemini's prompt_token_count includes the cached portion; Anthropic's input_tokens excludes it. The translator does the subtraction: input_tokens = prompt_token_count - cached_content_token_count, with cached_content_token_count becoming cache_read_input_tokens. Without this, every cache hit would double-charge the quota counter.

This last bullet is the kind of detail that gets discovered when the bill arrives. A naive translation that maps prompt_token_count → input_tokens directly would silently inflate the usage-tracking by 50–80% on any conversation with a stable system prompt. The platform's whole point of charging cache-weighted tokens is to give users a stable, predictable rate that mirrors the actual provider bill. Botching this would break the trust the per-identity quota depends on.

The Gemini-default decision

When the platform started, Anthropic was the only LLM. Switching the default to Gemini was a deliberate operational choice with three forks:

Cost. Gemini 2.5 Flash at $0.30/MTok input + $2.50/MTok output is cheaper than Anthropic Haiku 4.5 at $0.80 + $4.00. At the platform's typical request size (1500 input + 500 output tokens), that's $0.0017 per request on Gemini vs $0.0032 on Haiku. At 150,000 requests/month, a not-unrealistic production volume, the LLM bill is £201 vs £378. The 47% saving isn't an optimisation; it's the default.

Auth. Gemini on Vertex AI works through the agent service account's ADC token. There's no API key to seed, no Secret Manager binding to provision, no key-rotation workbook. A fresh GCP project deploys to a working agent without any external signups. Anthropic still requires an account, a key, and the rotation policy that goes with it.

Networking. Vertex AI calls go through Private Google Access: never traverse the public internet, never need Cloud NAT. Anthropic API calls go out via Cloud NAT, which is a fixed £25/month line item per environment regardless of usage. A Gemini-only deployment can drop NAT entirely (post 3 covers this).

But Anthropic stays available as an opt-in, gated behind a single tfvar:

Config snippethcl

# terraform/envs/stage/terraform.tfvars
anthropic_secret_name = "anthropic-api-key"   # non-empty turns it on

The Terraform module uses count on the Anthropic resources:

Config snippethcl

resource "google_secret_manager_secret" "anthropic" {
  count     = var.anthropic_secret_name == "" ? 0 : 1
  secret_id = var.anthropic_secret_name
  ...
}

Empty tfvar → no secret, no IAM binding, no env var on agent-svc (technically the env var is set but to empty, which the Python side treats as "not configured"). Non-empty → Terraform provisions the stub, you seed the value, callers can request claude-* model ids.

The user-facing behaviour on a Gemini-only deployment is precise: a claude-haiku-4-5-20251001 request returns 503 provider_not_configured with a clear error message. Not a 500. Not a 400. A 503, with "this deployment doesn't currently support this provider, here's why", is the right semantic answer.

Auto-derived summary models: a small thing that matters

When a conversation gets long, agent-svc compresses old turns into a <prior_context> block prepended to the most recent kept user message. The compression itself is another LLM call, made against a small/cheap model. In a single-provider world, you pick the model once and forget. In a multi-provider world, you have a choice to make every time.

The early version of the platform always summarised with Haiku, regardless of which model was driving the conversation. This worked but had an awkward consequence: a Gemini-only deployment still needed an Anthropic key, just for summarisation. That broke the "GCP-only platform" promise for any conversation long enough to trigger compression.

The fix is small and obvious in retrospect:

Code samplepython

def _summary_model_for(chat_model: str) -> str:
    if _SUMMARY_MODEL_OVERRIDE:
        return _SUMMARY_MODEL_OVERRIDE      # global env override wins
    if chat_model.startswith("claude-"):
        return _SUMMARY_MODEL_ANTHROPIC     # default: claude-haiku-4-5-20251001
    return _SUMMARY_MODEL_GEMINI            # default: gemini-2.5-flash

Chat on Claude → summarise on Claude. Chat on Gemini → summarise on Gemini. A Gemini-only deployment never reaches across the provider boundary. A multi-provider deployment naturally pays each provider's small model rate for its own conversation's compression.

This is the pattern this kind of platform engineering rewards: wherever there's a "we'll just use Claude for that" hidden somewhere in the code, find it and parameterise it. Every one of those is a future "wait, why do I need an Anthropic key for this Gemini-only deployment?" question.

Cost levers: where the bill actually comes from

Stage costs about £85/month. Production at 5,000 requests/day costs about £340/month. The breakdown is more interesting than the totals:

Line item	Prod (£/mo)	Why
LLM: Gemini 2.5 Flash default	~£201	150k requests at 1.5k in + 500 out
Memorystore Redis (STANDARD_HA)	~£83	HA replica, required for rate-limit + quota accuracy
Global Load Balancer	~£14	Fixed forwarding-rule + backend cost
Cloud NAT	~£25	Only needed for Anthropic; eliminable on Gemini-only
Cloud Run (gateway + auth + agent + web)	~£15	Min-instances=1 each, mostly memory
Everything else	~£2	Cloud DNS, Secret Manager, logging, monitoring

The LLM line is the biggest. Three levers move it.

Provider choice. Default to Gemini Flash. Done. Saves ~£177/month vs Haiku at the same volume. There is no smarter optimisation than starting with the cheaper provider.

Prompt caching. Anthropic's prompt cache reads bill at ~10% of input rate. Gemini's implicit cache reads bill at 0%. The platform's system-prompt block is marked with cache_control: {type: "ephemeral"} on the Anthropic side, which doesn't help Gemini (which caches implicitly anyway) but cuts the Anthropic input cost by 80–90% on the static prefix once the cache is warm. The <prior_context> compression block gets the same treatment, so subsequent turns in the same compressed conversation also cache-hit.

The cost-weighted token-budget math is wired to know this. When the gateway records usage:

Code samplepython

# From gateway-svc/src/models_config.py
charged_input = (
    usage["input_tokens"]
    + usage["cache_read_input_tokens"]  * (cache_read_rate / input_rate)
    + usage["cache_creation_input_tokens"] * (cache_write_rate / input_rate)
)

For Haiku 4.5 that's input + 0.1 × cache_read + 1.25 × cache_creation. For Gemini Flash it's input + 0 × cache_read + 0.1 × cache_creation. The per-identity quota counts these weighted figures, so users see a stable rate that matches what they're actually charged.

The two providers price caching very differently and the multipliers are misleading without that context. Anthropic charges per cache-read token: every read pays ~10% of the input rate, but charges nothing extra to hold the cache between calls (the 5-minute TTL is free storage). Gemini on Vertex flips it: cache reads are free of per-token charge (hence the 0× read multiplier), but you pay a per-MTok-per-hour storage fee while the cache exists. That storage cost isn't a per-request token charge, so it isn't in the formula above. It lands on the bill as a separate Vertex line item. Practical consequence: on Anthropic, every cache hit cuts the request cost. On Gemini, cache hits cut the request cost to zero but you've already paid for the time the cache sat in memory, so cached prefixes only win when re-reads are dense relative to the storage window.

Context compression. A conversation that crosses the CONTEXT_COMPRESSION_THRESHOLD (default 2048 estimated tokens) triggers a summary call before the next main call. The summary call itself costs tokens, about $0.002 per compression event on Gemini Flash and $0.005 on Haiku, but it cuts the next main call's input cost dramatically. For long-running chat conversations the math is heavily in favour of compression. For one-shot requests it never fires.

Beyond the LLM line, the next big optimisation is dropping Cloud NAT for a Gemini-only deployment. £25/month per environment, two environments, that's £50/month for the price of one tfvars change and a few minutes of Terraform validation. The Anthropic path's worth on the deployment has to clear that bar to keep NAT.

The SPA as a cost-control surface

The web UI is a surprisingly important cost surface because it's where ad-hoc model choices happen. Three deliberate UX decisions came out of running the platform for a few months:

New chats reset to the platform default. When you click "+ New chat", the new conversation starts on whatever the server's default_model is (currently gemini-2.5-flash), not on whatever you picked last. The friction this removes from cost overruns is meaningful. A user picks Opus 4.7 ($30/MTok combined) for one analysis, forgets, and every subsequent new chat would have stuck on Opus if the default were sticky. Forcing the reset means accidentally-expensive selections don't persist past the conversation that needed them.

For users who legitimately want stickiness, say, an engineer doing five Opus chats in a row, there's a per-identity localStorage toggle: Remember model. Off by default. Opt-in. The opt-in framing means the cheap-default behaviour is what new users get; stickiness is a deliberate choice.

Models are sorted cheapest-first. The /v1/models endpoint, which the SPA's model dropdown reads, returns models in the order they appear in config/models_config.json. The build script (scripts/build-models-config.sh) sorts each provider block by per-MTok cost. Gemini block first (cheapest provider), Anthropic block second; within each block, the lite model ahead of the flash model ahead of the pro model. A user scrolling the dropdown reads "cheap, cheaper, cheapest" in their primary path of attention. The expensive options are visible but require deliberate scrolling.

The usage block is opt-in. The gateway strips usage from the response by default; callers add "usage": true to the request body to get it back. This isn't a cost-control mechanism per se, but it shapes the developer experience: users who care about token accounting opt in and see real numbers. Users who don't, don't. The audit story is always available via /v1/usage regardless.

These three things together are not what you'd think of as "platform engineering": they're frontend choices. But each of them encodes a cost trade-off, and each is reversible in one PR. That's the right place for cost knobs to live: as close to the user as possible.

Adding a second agent

The platform's domain-agnostic shape is the part nobody believes until they actually do it. The promise: a new agent is a agent.yaml, a system prompt, an optional list of tools, and one Terraform module instantiation. Zero platform-code changes.

The proof is the lifecycle of a hypothetical second agent: call it acme-agent, the name doesn't matter:

1. Create the directory. agents/acme-agent/agent.yaml, agents/acme-agent/system-prompt.txt, agents/acme-agent/tools/. The contents look exactly like agents/infra-agent/ but with the new domain's config.

2. Define the tools (if any). Each tool is a Python file with a run() function and an INPUT_SCHEMA dict. The platform's tool registry imports the modules dynamically at startup from paths declared in agent.yaml:

Config snippetyaml

tools:
  - name: search
    module: tools/search.py
    service_account: acme-agent-search-sa
    iam_role: roles/cloudtrace.viewer  # whatever the tool actually needs
    description: "Search the acme domain by keyword..."

Each tool has its own service account with the minimum IAM the tool needs. The "read-only by IAM" tenet from post 1 lives at the tool-account level. A write-capable tool would have to be granted write permissions explicitly, and that grant would be visible in the Terraform PR.

3. Add the Terraform module instantiation. In terraform/envs/stage/main.tf:

Config snippethcl

module "acme_agent" {
  source = "../../modules/agent-domain"

  project_id            = module.project.project_id
  region                = var.region
  agent_name            = "acme-agent"
  agent_config          = yamldecode(file("../../../agents/acme-agent/agent.yaml"))
  agent_config_raw      = file("../../../agents/acme-agent/agent.yaml")
  system_prompt_content = file("../../../agents/acme-agent/system-prompt.txt")
  agent_image           = data.google_artifact_registry_docker_image.agent.self_link
  gateway_sa_email      = module.platform.gateway_sa_email
  ...
}

Same module the infra-agent uses. Same image (because agent-svc is generic: it serves whichever agent its AGENT_CONFIG_PATH points to). Different env vars per Cloud Run service so each one mounts its own agent.yaml. Apply.

4. Register the agent with the gateway. gateway-svc routes by an agent field in the request, looking up the matching agent-svc URL in a registry. Add the new URL:

Config snippethcl

# terraform/envs/stage/main.tf
locals {
  agent_registry = {
    "infra-agent" = module.infra_agent.agent_url
    "acme-agent"  = module.acme_agent.agent_url
  }
}

module "gateway" {
  ...
  agent_registry_json = jsonencode(local.agent_registry)
}

Apply again. The gateway re-reads its registry on the next revision rollout and starts forwarding "agent": "acme-agent" requests to the new service.

5. Done. No platform/agent-svc/src/ change. No platform/gateway-svc/src/ change. The new agent has its own service account, its own Secret Manager stubs (system prompt + optional Anthropic key), its own Cloud Run service, its own quota counter (each agent's identity is tracked independently). The /v1/chat endpoint now serves two agents; the SPA's agent dropdown surfaces both; the /v1/usage view sums across them or filters by agent.

The whole exercise is engineering: write the tools, tune the prompt, debug the IAM, and exactly nothing of it is platform architecture. That's the test the platform passes by being the same shape as the day-1 design said it would be.

Day-two patterns that emerge

Three things you only notice when the platform has been running for a while:

Secret rotation is invisible. Both infra-agent-system-prompt and the Anthropic key are polled every 60 seconds. gcloud secrets versions add is the whole rotation procedure. Cloud Run revisions never roll. The audit log shows exactly when the version changed and who pushed it. Compare to "edit env var → trigger CI → wait for build → wait for Cloud Run revision → drain old": the polling pattern makes the rotation almost free.

The token-budget alerts are the early warning system. Three thresholds, 75%, 90%, 95%, emit Cloud Monitoring custom metrics with the identity as a label. The first time someone wires up a Slack alert on the 75% breach, they discover the identity that's running 5× the expected daily volume. Usually it's a script with no exponential backoff. Sometimes it's a real upgrade in usage. Either way, the platform tells you before the cost report does.

The auto-title call is the operational smell test. Every successful chat triggers a small auto-title API call against the same provider as the main request. That call uses the same identity, the same quota, the same code path. If your platform breaks in any subtle way, the auto-title is usually the first thing that fails: it's a tiny request that exercises everything. When auto-titles stop appearing in the SPA sidebar, look at logs before the user reports anything.

What the platform is and isn't

After four posts, an explicit framing of what agent-fabric is:

A small, opinionated platform: three platform services, one optional SPA, a few hundred lines of provider-router code, a few thousand lines of Terraform.
Built for the case where you'll ship multiple agents, want one set of guardrails to apply to all of them, and don't want to fork the platform every time the domain changes.
Designed around the integration patterns real teams need (human, service, pipeline, agent-to-agent, partner) rather than the one happy path a prototype gets away with.
Cost-aware by default, with Gemini-on-GCP as the cheap path and Anthropic as the upgrade.

It is not:

A model. The platform doesn't train anything; it serves whatever the LLM providers ship.
A vector database. Vertex AI Vector Search handles retrieval; the platform handles the request lifecycle around it.
A workflow engine. Tool-use loops are bounded at five iterations by design. Long-running orchestration belongs in Workflows, Step Functions, or Temporal: not inside the agent.
A SaaS. There's no multi-tenant hosting model. Each deployment is one customer; cost lives where the GCP project does.

Five-line elevator pitch: agent-fabric is the platform you wish you'd built before your prototype shipped. It enforces zero-trust between services, layers guardrails in code and at IAM, factors the domain into agent.yaml, caps the worst-case request before optimising the average, and defaults to the cheap LLM. New agents ship without forking. Old agents keep working when the LLM landscape moves.

The rest is engineering.

That's the series. For the full architecture reference, see docs/architecture.md. For the day-to-day operational runbook, docs/deploy.md. For the project's open work items, docs/todo.md.

Or jump straight to the repo, github.com/FintelligenX/agent-fabric, and try post 2: five commands, ten minutes, a working multi-provider chat on localhost.

Multi-Provider, Multi-Agent: Scaling the agent-fabric in Production

Multi-provider, multi-agent: scaling the agent-fabric in production

The provider router: 100 lines that shouldn't matter

The Gemini-default decision

Auto-derived summary models: a small thing that matters

Cost levers: where the bill actually comes from

The SPA as a cost-control surface

Adding a second agent

Day-two patterns that emerge

What the platform is and isn't

The Engineering Notebook

You might also enjoy