IA
RAG in production: from €10,000 to €900 per month.
A European bank, a RAG pipeline, a hybrid strategy. How we cut inference costs by ten.
A European bank, 50,000 employees, an internal RAG document search assistant. First version shipped by a competing consultancy: 10,200 euros per day of inference cost, 87% uptime, p95 latency at 14 seconds. Unacceptable. We took over the project. Six weeks later: 900 euros per day, 99.7% uptime, p95 at 1.8 seconds. Here's how.
The diagnosis: everything goes through GPT-4
The initial architecture sent every user query, raw, to GPT-4 to: (1) rephrase it, (2) route to the right documents, (3) re-rank results, (4) generate the answer. Four premium LLM calls per request, three of them useless.
The hybrid architecture
Our principle: a small local model for everything that can be done locally, an external LLM only for the last mile. Concretely, on this project:
- Query rephrasing and expansion: fine-tuned Mistral 7B, hosted on internal A10 GPU. Marginal cost zero.
- Embedding and vector search: bge-m3, managed Pinecone database, ~50 ms per request.
- Re-ranking: local MiniLM-L6 cross-encoder, top-50 to top-5 in 80 ms.
- Final generation: Anthropic Claude Sonnet 4 on the 5 most relevant documents only.
The full pipeline
pythonfrom anthropic import Anthropic from local_models import MistralReformulator, BgeEmbedder, MiniLMReranker from pinecone import Index async def rag_query(user_query: str, tenant_id: str) -> str: # Étape 1-3 : 100 % local, ~250 ms reformulated = await MistralReformulator.expand(user_query) embeddings = await BgeEmbedder.encode(reformulated) candidates = await Index('bank-docs').query(embeddings, top_k=50, filter={'tenant': tenant_id}) top_5 = await MiniLMReranker.rank(user_query, candidates, k=5) # Étape 4 : LLM externe uniquement sur le last mile return await Anthropic().messages.create( model='claude-sonnet-4-20250514', max_tokens=800, messages=[{'role': 'user', 'content': build_prompt(user_query, top_5)}] )
GDPR native, not as a patch
Local models run in the bank's VPC in Frankfurt. No client data leaves toward the Anthropic API without being pre-filtered by a PII classifier (fine-tuned Microsoft Presidio). Prompts sent to Claude are logged, hashed, audited. The bank can produce a full audit trail in less than four hours.
Observability: we measure everything
Datadog LLM Observability on every span: latency per step, cost per request and per tenant, fallback rate, quality (measured by automatic eval on a golden dataset of 800 weekly questions). Alerts fire if cost per request drifts by more than 15% over 24 hours.
The final numbers
- Inference cost: 10,200 euros/day → 900 euros/day (-91%).
- Uptime: 87% → 99.7%.
- p95 latency: 14 seconds → 1.8 seconds.
- Internal user NPS: 11 → 68.
- GDPR + DORA compliance: audit passed in November 2025.
« The enterprise RAG trap is believing GPT-4 does everything. Real design is deciding what GPT-4 must absolutely not do. »
RAG in production is not a model problem, it's an architecture problem. If your RAG PoC costs an arm and holds up poorly under load, hybrid architecture usually solves both at once. Our teams know how to deploy it.
// Read next
IA
AI agents in production: avoiding the demo theatre.
Reliability, cost, security, evaluation. Seven patterns we actually use with our clients.
8 min
GreenOps
GreenOps: seven levers that cut 30% of your cloud bill.
Without sacrificing performance. Concrete cases: -30% on the bill, same SLOs.
6 min
Tech radar
Tech Radar 2026: why Rust and ROS 2 dominate.
Criteria, field reports, trade-offs. What we actually adopt vs what we assess.
10 min
