On-Premise RAG: Deployment Guide for Regulated Sectors

Most enterprise AI deployments are SaaS by default. For a meaningful subset of regulated work — privileged legal matters, classified government content, certain healthcare and financial workloads — SaaS is not allowed, not preferred, or not survivable in audit. For those contexts, on-premise RAG is the right architecture, and it is more achievable in 2026 than it was two years ago.

This guide walks through when on-prem is required (not just preferred), what the reference architecture looks like inside the perimeter, how the major compliance regimes map onto the deployment, and what a realistic three-year TCO looks like compared to SaaS. The AAAi Chat Book runs in a deployment model the American Arbitration Association can govern directly — not a SaaS dependency tucked behind another vendor’s pricing page — and that requirement shaped most of the decisions in this guide.

When on-prem RAG is required, not optional

Four trigger scenarios force on-prem, regardless of preference. If any apply, SaaS is off the table from the first conversation.

Privileged or confidential legal content. Attorney-client privilege protections in most common-law jurisdictions are sensitive to the chain of custody of the underlying material. Routing privileged content through a third-party SaaS — even one with strong security posture — creates a non-trivial waiver risk that prudent counsel will not accept. Some firms run on a private cloud as the middle option; others require fully on-prem with no external network egress for the sensitive subset of matters.

Classified or export-controlled material. Government and defence work involving classified content, ITAR-controlled technical data, or equivalent controlled information categories cannot legally use commercial SaaS. The deployment runs inside the agency’s or contractor’s secure environment, often air-gapped.

Strict data residency regimes. GDPR allows international data transfers in some circumstances, but specific national supervisory authorities (BfDI in Germany, CNIL in France) and certain sector regulators (financial services in Switzerland, healthcare in the Nordics) have been increasingly explicit that data must remain within national or EU boundaries — and that “stays in the EU but routes through US-headquartered cloud control planes” does not satisfy them. For those regimes, on-prem in a local data centre is the cleanest answer.

Internal policy that pre-dates the AI question. Many firms and publishers have data-handling policies that pre-date generative AI and prohibit sending client content to any third-party system. Updating policy to permit SaaS AI is itself a multi-month project. On-prem deployment sidesteps that policy work because nothing leaves the perimeter.

Outside these four scenarios, SaaS or private cloud is usually the better trade-off. On-prem carries real operational cost — see TCO below — and there is no point paying for it unless something forces it.

Reference architecture: retrieval, vector store, inference, audit log inside the perimeter

A well-built on-prem RAG deployment has the same architectural shape as a SaaS one, with two differences: every component runs inside the customer’s network perimeter, and there is usually richer audit logging because compliance demands it.

The components, inside the perimeter:

┌──────────────────────────────────────────────────────────────────┐
│                  Customer network perimeter                      │
│                                                                  │
│   ┌───────────────┐    ┌──────────────┐    ┌──────────────────┐  │
│   │  Document     │───▶│  Embedding   │───▶│  Vector store    │  │
│   │  ingestion    │    │  + chunking  │    │  (per-corpus     │  │
│   │  pipeline     │    │              │    │   thresholds)    │  │
│   └───────────────┘    └──────────────┘    └────────┬─────────┘  │
│                                                     │            │
│   ┌───────────────┐    ┌──────────────┐    ┌────────▼─────────┐  │
│   │  Query +      │───▶│  Retrieval   │◀───│  Reranker        │  │
│   │  user context │    │  (hybrid)    │    │  (cross-encoder) │  │
│   └───────────────┘    └──────┬───────┘    └──────────────────┘  │
│                               │                                  │
│                        ┌──────▼───────┐    ┌──────────────────┐  │
│                        │  Self-eval   │───▶│  LLM (local)     │  │
│                        │  (pre-gen)   │    │  generation      │  │
│                        └──────────────┘    └────────┬─────────┘  │
│                                                     │            │
│   ┌───────────────┐    ┌──────────────┐    ┌────────▼─────────┐  │
│   │  Audit log    │◀───│  Self-eval   │◀───│  Cited answer    │  │
│   │  (immutable)  │    │  (post-gen)  │    │  with sources    │  │
│   └───────────────┘    └──────────────┘    └──────────────────┘  │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

The non-obvious parts:

The LLM is local. Either an open-weights model (Llama, Mistral, Qwen, etc.) running on customer-owned GPUs, or a permitted commercial model deployed inside the perimeter under the appropriate licence. No inference calls leave the perimeter.

The vector store is local. Pinecone, hosted Weaviate, hosted Qdrant — all SaaS by default. For on-prem, the vector store runs on customer infrastructure: self-hosted Weaviate, Qdrant, pgvector on Postgres, or equivalent.

Retrieval and reranking are local. Cross-encoder rerankers (typically open-source, e.g., from the BGE or Jina families) run on the same GPU pool as the LLM or a separate one. Hosted rerankers like Cohere’s are usually not an option because they are external network calls.

Self-evaluation is local. Pre-generation and post-generation self-evaluation runs against the same local model pool, or a different local model deliberately chosen to catch the generator’s failure modes.

Audit logging is mandatory and immutable. Every query, every retrieval result with relevance scores, every self-eval verdict, every generated answer with its citations is logged. For regulated work the audit log is often required to be append-only with cryptographic integrity guarantees.

In our deployments, the per-corpus calibration that drives retrieval thresholds is set during onboarding and rarely changes — but when it does, the change is logged with the same rigour as a query. The audit log is what makes the system defensible if an output is ever challenged.

Different compliance regimes pull different levers. The mapping below is a starting point, not legal advice — work it through with counsel for your specific deployment.

Regime	Primary requirement for RAG	What on-prem buys you
GDPR (EU)	Lawful basis for processing, data subject rights, restrictions on international transfers	Data stays in your jurisdiction; transfer issues largely avoided; data subject rights (access, erasure) are easier to operationalise because you control the index
HIPAA (US healthcare)	PHI protection, Business Associate Agreement (BAA) chain	No third-party BAA needed for the AI vendor when the system runs inside your environment; existing infrastructure BAAs cover the deployment
ITAR (US export control)	Restrictions on access to controlled technical data by non-US persons	Inference and retrieval happen inside a US-person-only environment; no risk of inadvertent foreign-person access via vendor support
EU AI Act	Risk classification, transparency, traceability, human oversight	On-prem makes traceability and human oversight operationally clean: full audit log, no external dependencies in the decision chain
FCA / UK financial services	Operational resilience, third-party risk management	On-prem removes a category of third-party risk entirely; resilience requirements apply to your own infrastructure, which you already manage
Local data residency (e.g., German BfDI, Swiss FINMA)	Data must remain in jurisdiction; control plane location matters	On-prem in a local data centre is the cleanest answer; private cloud may suffice if control plane is also local

Two cross-cutting observations. First, the EU AI Act in particular leans hard on traceability — every high-risk AI system needs to demonstrate the reasoning behind its outputs. On-prem RAG with full audit logging is one of the clearest ways to satisfy this. Second, regulators have increasingly distinguished between “data resident in jurisdiction” and “control plane in jurisdiction.” A US-hosted SaaS control plane managing infrastructure that physically sits in the EU may still not satisfy German or French supervisory authorities.

Deployment patterns: air-gapped vs private-cloud vs VPC

“On-prem” covers a spectrum. Three patterns are common, each with different operational profiles.

Air-gapped. No network connectivity to the outside world. Updates (model weights, software patches, new content) are delivered via approved media transfer processes. Used for classified government work, certain defence contractors, and high-security industrial settings. Operationally the most demanding — every dependency must be vendored, every update is a controlled event.

Private cloud / dedicated tenancy. Customer-managed infrastructure with controlled outbound egress for software updates and approved telemetry, but no customer data leaves. Used for most regulated commercial deployments where air-gap is unnecessary but third-party processor risk is unacceptable. The middle option, and the most common in practice.

Customer VPC. The vendor’s software runs inside the customer’s cloud account (AWS, Azure, GCP, OVH, etc.). Inference, retrieval, and storage all sit in customer-controlled infrastructure. Less rigorous than true on-prem (the cloud provider still has operational access to the underlying hardware) but acceptable for many compliance regimes and substantially cheaper than physical on-prem.

The right pattern follows the trigger. Classified content → air-gapped. Privileged legal matters in a private cloud → private cloud or VPC. Strict data residency → VPC in a local region or physical on-prem in a local data centre.

TCO model: on-prem vs SaaS over 3 years (what most RFPs miss)

RFP comparisons routinely under-cost on-prem by treating it as “buy the licence, pay your IT team.” That misses several real cost lines. Conversely, they over-cost SaaS by ignoring data-handling and procurement overheads. The honest comparison has more line items than either side wants to put in the spreadsheet.

The line items that matter over three years:

On-prem costs

Software licence (annual or perpetual + maintenance)
GPU compute (initial purchase or amortised) — non-trivial; an LLM-capable GPU server can run $30-100k+ depending on the model class
GPU lifecycle (refresh every 3-5 years; this is closer than people think for the current generation)
Vector store + retrieval infrastructure (modest)
Storage for the index and audit log (volume-dependent)
Power and cooling (often forgotten; meaningful for GPU loads)
Internal operational time (an FTE-fraction for ops, plus security review, plus content-pipeline maintenance)
Update/upgrade cycles (model upgrades, software upgrades, retraining of test corpora)

SaaS costs

Per-seat or per-token pricing (often the only line in the RFP)
Vendor security review (recurring; meaningful FTE time on a regulated procurement)
Data Processing Agreement legal review (one-time + per amendment)
Egress costs if integrating with other systems
Audit cooperation (your team’s time when the vendor pulls audit reports)
Contractual risk premium (you are now dependent on the vendor’s continued operation and pricing)

For most regulated deployments at meaningful scale (50+ users, real content volume), the three-year TCO comes out closer than either side initially claims. On-prem is heavier in capex and ops time; SaaS is heavier in per-user and risk-premium pricing. The right answer is rarely “the cheaper one” — it’s “the one that fits the compliance posture, where TCO is within acceptable bounds.”

A common pattern in our deployments: start with SaaS for evaluation and proof-of-value, then migrate to on-prem or private cloud once the production use case is confirmed and the compliance posture is set. This split usually optimises both speed-to-value and end-state cost.

Vendor evaluation questions for on-prem RAG

A short list of questions that separate vendors who deploy on-prem in practice from vendors whose “on-prem option” is a marketing checkbox:

Can you walk me through your most recent on-prem deployment, with the customer’s identifying details anonymised? (A vendor with real on-prem experience can describe specifics.)
Which open-weights models do you support, and how do you handle model upgrades on-prem?
What is the on-prem deployment install footprint? (CPU/GPU, RAM, disk, network.)
How do you handle vector store sizing as our corpus grows?
What does your audit log contain, and what is its retention model?
How do you support the self-evaluation step on-prem? (Same local model? Different local model? Hosted?)
What is your update cadence, and how do model/software updates get into an air-gapped environment?
What is your support model when something breaks? (Remote shell into our environment? Read-only telemetry? On-site?)
What does your contract say about your access to our data, content, or audit logs? (For real on-prem this should be: none.)
What does the compliance review look like? Have you been through a Big Four audit on a regulated client engagement?

Vendors who flinch at any of these are not ready for serious on-prem work.

How Edtek deploys

Edtek Chat ships in the same internal shape across SaaS, private cloud, customer VPC, and fully on-prem. What changes on-prem is not the technique inventory; it is what each technique looks like when it has to run inside the customer’s perimeter. The interesting question for this guide is the second one.

The reranker, when Cohere is not an option. Most SaaS-grade RAG defaults to a hosted reranker — Cohere is the common choice — because it is fast and accurate. On-prem deployments cannot reach a hosted endpoint without breaking the perimeter. The on-prem equivalent is an open-source cross-encoder running on the customer’s own GPUs: slower per query than the hosted version, more expensive in compute, but inside the environment. For most regulated workloads the latency difference is well inside the budget; the cost difference shows up in capacity planning, not in user experience.

Per-corpus threshold calibration on bespoke corpora. A SaaS vendor calibrates once against averaged customer data and ships a default that is wrong for any specific client. On-prem deployments give us the opposite: every deployment is bespoke and so is the calibration. Thresholds are set against the actual corpus the client is putting in front of the system, not a synthetic benchmark.

On-prem self-evaluation, on-prem judge. When self-evaluation runs, it runs against a local judge model — not a hosted API. That changes two things: the cost model (one inference becomes two on the same hardware), and the privacy posture (the judging step never leaves the perimeter either). A deployment that runs the generator on-prem but reaches out to a hosted judge for self-eval has not actually deployed on-prem.

The audit log as customer-controlled storage. In a SaaS deployment the audit log lives on shared infrastructure with rotation policies the customer does not control. On-prem inverts that: the log is the customer’s data on the customer’s storage with the customer’s retention policy. For regulated work this is often the load-bearing reason on-prem exists at all.

Retrieval, the part that does not change much. Two-mode retrieval (lexical and semantic, combined) runs the same on-prem as it does in SaaS — vector stores are mature enough that self-hosted Weaviate, Qdrant, or pgvector cover the needs of all but the largest corpora. The on-prem implication is operational rather than architectural: someone on the customer side runs the database, not us.

The choice of configuration is the client’s. For some we ship SaaS because that is what the corpus posture allows. For others we ship into the client’s VPC because data-residency rules demand it. For the AAAi Chat Book and similar deployments, the architecture is the same as what we describe in this guide.

One operational detail worth flagging: for clients who want to build richer workflows on top of the RAG core (review queues, approval steps, conditional routing), we integrate n8n as the workflow engine. For clients without that need, the implementation is straight server-side code. The choice depends on whether the client team plans to own and modify workflows themselves, or whether the workflows are stable and lower-level code is fine.

Hallucination-Proof RAG Architecture — the architectural baseline this deployment guide assumes.
RAG vs Fine-Tuning for Regulated Enterprises — when to use which, with the same audience in mind.
Chunking Strategies for Legal & Reference RAG Systems — per-corpus calibration referenced in the architecture section.
Document Automation for In-House Legal Teams — adjacent vendor landscape for on-prem-capable tools.

Frequently asked questions

What is air-gapped RAG?

A RAG deployment with no network connectivity to the outside world. All components — model, vector store, retrieval, reranker, self-evaluation, audit log — run inside the secure perimeter. Updates and content are delivered via approved media transfer. Used for classified government work and certain defence-contractor and high-security industrial settings.

When is on-premise RAG actually required versus just preferred?

Required: privileged or confidential legal content, classified or ITAR-controlled material, strict data residency regimes that exclude US-headquartered SaaS control planes, and internal policies that pre-date generative AI and prohibit any third-party processing. Outside these scenarios, SaaS or private cloud is usually a better trade-off.

Does on-prem RAG work with the same architecture as SaaS RAG?

Yes. The components are identical — retrieval, reranker, vector store, generator, self-evaluation, audit log. The differences are operational: every component runs inside the customer perimeter, audit logging is more rigorous because compliance demands it, and the generator is typically an open-weights or licensed local model rather than a hosted API.

What does the three-year TCO look like vs SaaS?

For meaningful-scale deployments (50+ users, real content volume), the three-year TCO comes out closer than either side initially claims. On-prem is heavier on capex (GPU purchase, GPU refresh, internal ops time) and lighter on per-user fees. SaaS is the inverse. The right choice rarely turns on raw cost — it turns on whether compliance posture and risk model fit the deployment.

Can on-prem RAG use the same models as SaaS?

Not always. SaaS deployments typically call hosted commercial models (GPT, Claude, Gemini). On-prem deployments use open-weights models (Llama, Mistral, Qwen, etc.) or commercial models the vendor has licensed for in-perimeter deployment. The architecture is identical; the model choice changes. Quality on open-weights models has caught up substantially in 2025-2026 for most enterprise tasks.

How does on-prem affect model updates?

Updates are a controlled event. New model versions are validated against per-client regression tests before deployment, then transferred into the environment via the same channel used for any controlled software update. For air-gapped environments this is via approved media. Frequency varies: many regulated clients update once or twice a year deliberately, not monthly.

What about the EU AI Act for on-prem deployments?

On-prem deployments make several AI Act requirements operationally cleaner. Traceability is straightforward because the full audit log is under the deploying organisation’s control. Human oversight is operationally simpler because the entire decision pipeline runs in-house. The risk-classification and documentation requirements still apply — they apply regardless of deployment model — but the evidence base is easier to assemble.

How do you handle the LLM self-evaluation step on-prem?

Same way as the generation step: a local model runs the self-evaluation. Often a different model from the generator, deliberately, to catch the generator’s blind spots. The choice between always-on, on-edge-cases, or off entirely is a per-client policy decision, governed by the latency budget and the cost tolerance of running an extra inference per query.