Abbeal

Banque tier-1 · Paris

European bank: hybrid RAG, inference cost cut tenfold.

RAG PoC at €10,000/month. Hybrid local+cloud architecture, cache, reranking. €900/month at equal quality.

KPI

/10

coût inférence mensuel

Duration

7 mois

Team

4 engineers

Hub(s)

Paris

Mistral 7BvLLMQdrantLangSmithAWS Bedrock

10,000 EUR per month for an internal RAG used by 40 people is 250 EUR per user. At that price, you might as well hire a junior.

The context

European tier-1 bank, Paris hub, 12-person AI innovation team. Internal RAG POC for regulatory assistance and document search (corpus of 280,000 documents). Everything on GPT-4 via Azure OpenAI, no routing, no cache, no systematic evaluation.

The problem

  • Monthly cost: 10,200 EUR for 40 active users
  • p95 latency: 4.1s, perceived as slow
  • No continuous evaluation mechanism, quality estimated subjectively
  • Naive retrieval (pure cosine), lots of useless context injected
  • IT-Compliance committee blocks rollout until cost is under control

The approach

Hybrid architecture via routing: a complexity classifier routes the query to local fine-tuned Mistral 7B for 78% of cases (factual questions), and to GPT-4 for the 22% complex ones (multi-document reasoning, regulatory comparisons).

The optimizations

  • Fine-tuning Mistral 7B on 12,000 Q/A pairs from the corpus
  • Embedding cache (Redis) with adaptive TTL
  • Hybrid retrieval: BM25 + dense, cross-encoder reranking
  • Continuous evaluation: LangSmith + internal benchmark of 800 questions
  • Routing with automatic fallback if confidence < threshold

The stack

  • Fine-tuned Mistral 7B (LoRA), served via vLLM on g5.xlarge
  • GPT-4 Turbo via AWS Bedrock for complex cases
  • Qdrant for vector store (HNSW), Redis for embedding cache
  • LangSmith for traces and evaluation
  • FastAPI for orchestration

The results

  1. Monthly cost: 10,200 EUR to 900 EUR (-91%)
  2. p95 latency: 4.1s to 1.2s
  3. Quality (internal benchmark): 87% to 91% (better than before)
  4. Active users: 40 to 320 in 4 months post-rollout
  5. Compliance committee: green light for rollout
« Abbeal spared us the classic banking mistake: believing that because it's expensive, it must be good. Their systematic measurement approach unlocked three years of internal blockers. »
Head of AI · Tier-1 bank

What we learned

LoRA fine-tuning works very well at this volume (12k pairs), no need to aim for 100k. vLLM holds the load but requires generous GPU sizing for p99 latency. Mistake: we underestimated the work of building the evaluation benchmark (six weeks instead of two). To redo: start with the benchmark, it's the compass of the entire project.

A similar case at your place?

Talk to an architect