Banque tier-1 · Paris

European bank: hybrid RAG, inference cost cut tenfold.

RAG PoC at €10,000/month. Hybrid local+cloud architecture, cache, reranking. €900/month at equal quality.

KPI

/10

coût inférence mensuel

Duration

7 mois

Team

4 engineers

Hub(s)

Paris

Mistral 7BvLLMQdrantLangSmithAWS Bedrock

10,000 EUR per month for an internal RAG used by 40 people is 250 EUR per user. At that price, you might as well hire a junior.

The context

European tier-1 bank, Paris hub, 12-person AI innovation team. Internal RAG POC for regulatory assistance and document search (corpus of 280,000 documents). Everything on GPT-4 via Azure OpenAI, no routing, no cache, no systematic evaluation.

The problem

Monthly cost: 10,200 EUR for 40 active users
p95 latency: 4.1s, perceived as slow
No continuous evaluation mechanism, quality estimated subjectively
Naive retrieval (pure cosine), lots of useless context injected
IT-Compliance committee blocks rollout until cost is under control

The approach

Hybrid architecture via routing: a complexity classifier routes the query to local fine-tuned Mistral 7B for 78% of cases (factual questions), and to GPT-4 for the 22% complex ones (multi-document reasoning, regulatory comparisons).

The optimizations

Fine-tuning Mistral 7B on 12,000 Q/A pairs from the corpus
Embedding cache (Redis) with adaptive TTL
Hybrid retrieval: BM25 + dense, cross-encoder reranking
Continuous evaluation: LangSmith + internal benchmark of 800 questions
Routing with automatic fallback if confidence < threshold

The stack

Fine-tuned Mistral 7B (LoRA), served via vLLM on g5.xlarge
GPT-4 Turbo via AWS Bedrock for complex cases
Qdrant for vector store (HNSW), Redis for embedding cache
LangSmith for traces and evaluation
FastAPI for orchestration

The results

Monthly cost: 10,200 EUR to 900 EUR (-91%)
p95 latency: 4.1s to 1.2s
Quality (internal benchmark): 87% to 91% (better than before)
Active users: 40 to 320 in 4 months post-rollout
Compliance committee: green light for rollout

« Abbeal spared us the classic banking mistake: believing that because it's expensive, it must be good. Their systematic measurement approach unlocked three years of internal blockers. »

— Head of AI · Tier-1 bank

What we learned

LoRA fine-tuning works very well at this volume (12k pairs), no need to aim for 100k. vLLM holds the load but requires generous GPU sizing for p99 latency. Mistake: we underestimated the work of building the evaluation benchmark (six weeks instead of two). To redo: start with the benchmark, it's the compass of the entire project.

// Read next

A similar case at your place?

Talk to an architect