The constraint that reshapes the stack
Most AI tutorials begin with an API call. For a large class of real customers — fintech, healthcare, public sector, defence — that's where the tutorial ends.
Many enterprises legally cannot send their data to a third-party inference API. Patient records, banking customer data, classified or sovereign information, and increasingly any data covered by national data-localisation rules — none of it can leave the customer's perimeter without specific, often unattainable, contractual and regulatory cover.
For these customers, AI architecture is shaped by a single first-class question: where can the model run, and where can the data go? Every other choice flows from the answer.
Three deployment patterns
On-prem and VPC-native. The model, the inference runtime, the vector store, the orchestration layer, and the agent loop all live inside the customer's perimeter. Strongest posture, most operationally demanding.
Sovereign cloud. The same inference stack, hosted in a national jurisdiction by a sovereign cloud provider. Customer trades operational control for a managed environment that still satisfies sovereignty constraints.
Hybrid routing. Sensitive data is processed inside the perimeter; non-sensitive data is routed to frontier APIs. A policy layer decides per request. Dominant pattern when frontier capability is genuinely required for some tasks but most data is regulated.
The choice is rarely "what's most secure" — it's "what posture does the regulator and the contract actually require, and what is the operational appetite of the customer?" Many large customers run all three patterns at once, for different workloads.
What "in-perimeter" actually means
A genuinely in-perimeter AI deployment satisfies, at minimum: compute on hardware the customer controls; weights stored in the customer's storage; data that does not leave the perimeter, including in error logs and analytics; identity and access integrated into the customer's IdP; controlled network paths; and audit logging that the customer's compliance team can query.
A vendor offering "private deployment" that requires uploading weights to their managed service, sending logs to a third-party observability platform, or pulling embeddings from an external API is not in-perimeter in this sense.
Open-weight models as the foundation
The reason in-perimeter AI is now realistic is the open-weight model wave. Llama, DeepSeek, Qwen, Mistral, and an increasing number of India-built models are capable enough at most enterprise tasks to be the right answer when frontier APIs are off-limits.
Productionising them inside the perimeter takes real engineering: model selection, inference runtime (vLLM, TGI, TensorRT-LLM, SGLang), quantisation, GPU planning, and fine-tuning infrastructure. Customers running this stack should expect their AI estate to look more like an inference platform than an API client.
Hybrid routing — the practical default
Most regulated customers find hybrid routing to be the practical default. The policy has three pieces: data classification (every input tagged at ingestion), routing rules (each task type matched against the data tag), and egress enforcement (the orchestration layer enforces the routing decision at the call site, not as an honour system).
When the policy is well-designed, the engineering team writes business logic against task names and the routing layer turns each call into the correct model on the correct deployment.
Audit and traceability
Regulated environments do not accept "the model decided." They accept "the model was given these inputs, retrieved these documents, called these tools, produced this output, and here is the trace."
A working audit trail captures, per agent invocation: the initiator, the input after redaction, the retrieval queries and documents returned, every model call, every tool call, the output and confidence, and any human override. The trace is written by the system as it runs, not reconstructed from logs after the fact, and it is queryable storage, not append-only logs.
Identity, secrets, and key management
In-perimeter AI is not just a model and a GPU. It is a service in the customer's environment, and it must conform to the customer's identity and secrets posture: integration with the customer's IdP, service identities for machine-to-machine calls, secrets stored in the customer's vault, and clear separation of duties.
None of this is research. All of it is the difference between a system the customer's security team approves and one they reluctantly accept under pressure.
The cost question
Frontier-API-only stacks have low fixed cost and high variable cost. In-perimeter stacks have high fixed cost and low variable cost. For many regulated customers, the crossover point is closer than they assume.
Hybrid routing also changes the picture: the in-perimeter platform handles the high-volume regulated workload, frontier APIs handle the lower-volume complex non-sensitive workload, total cost is lower than either pure approach. The right answer is workload-specific and worth modelling explicitly.
The shape of a serious sovereign deployment
A serious sovereign or in-perimeter AI deployment looks roughly like this: an inference platform running open-weight models on the customer's hardware or sovereign cloud; a vector and structured retrieval layer inside the perimeter; an orchestration layer that classifies data, routes by sensitivity, and enforces egress rules; a small set of agents with deterministic fallbacks, structured outputs, and explicit handover surfaces; an audit pipeline; identity and secrets aligned with the customer's existing posture; and observability that makes the unit economics legible.
Each is an engineering investment. None is exotic. The vendors that take them seriously are the ones whose AI deployments are still running quietly two years later.