Banque tier-1 · Paris
European bank: hybrid RAG, inference cost cut tenfold.
RAG PoC at €10,000/month. Hybrid local+cloud architecture, cache, reranking. €900/month at equal quality.
KPI
/10
coût inférence mensuel
Duration
7 mois
Team
4 engineers
Hub(s)
Paris
10,000 EUR per month for an internal RAG used by 40 people is 250 EUR per user. At that price, you might as well hire a junior.
The context
European tier-1 bank, Paris hub, 12-person AI innovation team. Internal RAG POC for regulatory assistance and document search (corpus of 280,000 documents). Everything on GPT-4 via Azure OpenAI, no routing, no cache, no systematic evaluation.
The problem
- Monthly cost: 10,200 EUR for 40 active users
- p95 latency: 4.1s, perceived as slow
- No continuous evaluation mechanism, quality estimated subjectively
- Naive retrieval (pure cosine), lots of useless context injected
- IT-Compliance committee blocks rollout until cost is under control
The approach
Hybrid architecture via routing: a complexity classifier routes the query to local fine-tuned Mistral 7B for 78% of cases (factual questions), and to GPT-4 for the 22% complex ones (multi-document reasoning, regulatory comparisons).
The optimizations
- Fine-tuning Mistral 7B on 12,000 Q/A pairs from the corpus
- Embedding cache (Redis) with adaptive TTL
- Hybrid retrieval: BM25 + dense, cross-encoder reranking
- Continuous evaluation: LangSmith + internal benchmark of 800 questions
- Routing with automatic fallback if confidence < threshold
The stack
- Fine-tuned Mistral 7B (LoRA), served via vLLM on g5.xlarge
- GPT-4 Turbo via AWS Bedrock for complex cases
- Qdrant for vector store (HNSW), Redis for embedding cache
- LangSmith for traces and evaluation
- FastAPI for orchestration
The results
- Monthly cost: 10,200 EUR to 900 EUR (-91%)
- p95 latency: 4.1s to 1.2s
- Quality (internal benchmark): 87% to 91% (better than before)
- Active users: 40 to 320 in 4 months post-rollout
- Compliance committee: green light for rollout
« Abbeal spared us the classic banking mistake: believing that because it's expensive, it must be good. Their systematic measurement approach unlocked three years of internal blockers. »
What we learned
LoRA fine-tuning works very well at this volume (12k pairs), no need to aim for 100k. vLLM holds the load but requires generous GPU sizing for p99 latency. Mistake: we underestimated the work of building the evaluation benchmark (six weeks instead of two). To redo: start with the benchmark, it's the compass of the entire project.
// Read next
Mobilité urbaine · Paris + Montréal
Mobility scale-up: −30% cloud bill, same SLOs.
AWS bill doubled in 18 months without matching traffic growth. GreenOps audit, refactor, Karpenter, ARM64. Measured outcome.
−30%
facture cloud
E-commerce sport · Paris
Sports leader: PWA, +18% mobile conversion, Lighthouse 92.
Mobile Lighthouse at 38, conversion falling. Next.js App Router, edge, images, splitting. Delivered in 6 months.
+18%
conversion mobile
Robotique industrielle · Tokyo
Japanese industrial: 80 AGVs, ROS 2, +40% warehouse throughput.
Slow fleet, collisions, downtime. Nav2 refactor, perception fusion, multi-agent planning. Zero collisions in 6 months.
+40%
throughput entrepôt
