Kalman AI

The myth of the one-model stack

It is tempting to assume a production AI system has a single model behind it — pick the best frontier model, point your application at its API, and ship. That assumption survived contact with reality for about a year.

Modern production stacks are hybrid by default. A frontier model handles complex reasoning. A smaller, cheaper model does the routine work. An open-weight model runs locally for sensitive data. An in-house fine-tune covers a specific repeatable task.

This is not architectural over-engineering. It is what happens when you take the actual constraints of a business — cost, latency, sensitivity, capability, vendor risk — seriously.

Why no single model wins

Different models are good at different things, at different prices, with different deployment options.

Capability isn't uniform. A frontier model might be the only one that can plan a complex workflow reliably. The same model is overkill for routine classification.

Cost varies by orders of magnitude. Frontier API calls can be 50–100× the cost of a small model on the same input.

Latency matters. A 4-second frontier call is fine for a research synthesis step. It is fatal in a real-time agent loop.

Sensitivity dictates deployment. Some data legally cannot leave the customer's perimeter. Frontier APIs are not an option for that data.

Vendor risk is real. Single-provider stacks are exposed to outages, deprecations, pricing changes, and policy shifts.

What the orchestration layer actually does

A multi-model orchestration layer answers four questions for every request: capability (which model is good enough), sensitivity (can this data leave the perimeter), latency (how fast must the response be), cost (what are the unit economics).

It then turns those answers into a routing decision and abstracts the choice away from the application. Application code says "do this kind of task on this kind of data" and the orchestration layer decides which model handles it.

Routing — the hard part

Routing sounds simple — pick the right model — and is, in practice, where most of the design effort lives. Capability-based routing maps task type to model. Sensitivity-based routing maps data classification to deployment surface. Latency-based routing respects SLOs. Cost-based routing respects budgets. Capability fallback chains keep the system running when a provider degrades.

The router itself is rarely a model. It is usually a small policy engine: deterministic rules first, occasional model-based classification for the cases the rules can't decide.

Provider abstraction

Every model behind a different API is a slightly different shape. Token counting differs. Streaming protocols differ. Tool-calling formats differ. Error semantics differ. A provider-abstraction layer normalises all of this. The application sees a single internal API — complete(), embed(), rerank(), call_tool() — and the abstraction layer translates it to whichever provider is on the other end.

This work is unglamorous. It is also the thing that lets a stack swap providers in an afternoon instead of a quarter.

Open-weight integration — the underrated half

A large share of recent capability gains has come from open-weight models — Llama, DeepSeek, Qwen, Mistral, India-built models — that can be deployed inside the customer's perimeter. Productionising them is meaningfully different from calling a hosted API: inference runtime, GPU scheduling, model lifecycle, fine-tuning infrastructure, and observability all become first-class concerns.

Multi-model orchestration that includes open-weight models is therefore not just a routing problem; it is a platform problem.

Cost and latency control

Three patterns recur for keeping unit economics under control. Prompt caching turns repeating context into cached input. Batched inference turns single calls into batched calls at a fraction of the cost. Speculative + cascade routing tries a small model first and escalates to a larger one only when needed.

A discipline of measuring cost-per-decision (not cost-per-call) and latency-at-percentile (not average latency) is what keeps these patterns honest.

Observability for hybrid stacks

The single biggest difference between a prototype hybrid stack and a production one is observability. Per request, the system records the task type, the router's decision, the chosen provider and model, latency and cost, and the outcome. Aggregated, this becomes the dashboard that tells you which slice of your traffic is getting worse, not just whether the average looks fine.

A small architectural rule

Agents and applications should depend on task names, not model names. If your application code says "call gpt-4 with these messages," the orchestration layer is leaking. If it says "draft_summary(input, sensitivity=high)" and the orchestration layer turns that into the right model on the right deployment, you have a system that can absorb the next year of model churn without rewriting itself.

The unglamorous summary

Multi-model orchestration is not the headline part of an AI project. It is the part that decides whether the headline part survives in production.

Multi-Model Orchestration — How Production AI Stacks Actually Work