Tier-1 bank · Paris
European bank: hybrid RAG, inference cost cut tenfold.
RAG PoC at €10,000/month. Hybrid local+cloud architecture, cache, reranking. €900/month at equal quality.
KPI
/10
monthly inference cost
Duration
7 mois
Team
4 engineers
Hub(s)
Paris
10,000 EUR per month for an internal RAG used by 40 people is 250 EUR per user. At that price, you might as well hire a junior.
The context
European tier-1 bank, Paris hub, 12-person AI innovation team. Internal RAG POC for regulatory assistance and document search (corpus of 280,000 documents). Everything on GPT-4 via Azure OpenAI, no routing, no cache, no systematic evaluation.
The problem
- Monthly cost: 10,200 EUR for 40 active users
- p95 latency: 4.1s, perceived as slow
- No continuous evaluation mechanism, quality estimated subjectively
- Naive retrieval (pure cosine), lots of useless context injected
- IT-Compliance committee blocks rollout until cost is under control
The approach
Hybrid architecture via routing: a complexity classifier routes the query to local fine-tuned Mistral 7B for 78% of cases (factual questions), and to GPT-4 for the 22% complex ones (multi-document reasoning, regulatory comparisons).
The optimizations
- Fine-tuning Mistral 7B on 12,000 Q/A pairs from the corpus
- Embedding cache (Redis) with adaptive TTL
- Hybrid retrieval: BM25 + dense, cross-encoder reranking
- Continuous evaluation: LangSmith + internal benchmark of 800 questions
- Routing with automatic fallback if confidence < threshold
The stack
- Fine-tuned Mistral 7B (LoRA), served via vLLM on g5.xlarge
- GPT-4 Turbo via AWS Bedrock for complex cases
- Qdrant for vector store (HNSW), Redis for embedding cache
- LangSmith for traces and evaluation
- FastAPI for orchestration
The results
- Monthly cost: 10,200 EUR to 900 EUR (-91%)
- p95 latency: 4.1s to 1.2s
- Quality (internal benchmark): 87% to 91% (better than before)
- Active users: 40 to 320 in 4 months post-rollout
- Compliance committee: green light for rollout
« Abbeal spared us the classic banking mistake: believing that because it's expensive, it must be good. Their systematic measurement approach unlocked three years of internal blockers. »
What we learned
LoRA fine-tuning works very well at this volume (12k pairs), no need to aim for 100k. vLLM holds the load but requires generous GPU sizing for p99 latency. Mistake: we underestimated the work of building the evaluation benchmark (six weeks instead of two). To redo: start with the benchmark, it's the compass of the entire project.
// Read next
Luxury jewellery & watchmaking · Genève + Paris + Tokyo
Cartier: from audit to in-house private LLM.
Compass (front + back architecture audits), Mapper (watchmaking + jewellery product generator), competitive data ETL on BigQuery, and now a private LLM fine-tuned on Cartier's own infra. A long-term tech partnership on the data and AI stack of a luxury house.
LLM privé
fine-tuned on Cartier infra
Tier-1 bank · Paris
BNP Paribas: Reference Book PO, from React/Redux to product AI agents.
Three Abbeal engineers at the core of the PO Marketplace. React/Redux/Node platform initially, now augmented with a product RAG, Claude agents for PM assistance, and an event-driven Kafka layer to scale.
RAG
PO product catalog
Digital banking / FinTech · Tokyo (Tamachi)
Money Forward: data backbone of a brand-new digital bank in Tokyo.
Money Forward, a Japanese FinTech leader listed in Tokyo, partnered with a top-tier Japanese banking group to launch a brand-new digital bank built from scratch. Abbeal partners on the Data Engineering side: designing and operating the Data Hub (Databricks + Delta Lake + dbt + AWS Tokyo) serving JFSA reporting, AML, risk management.
Data Hub
from-scratch digital bank Tokyo
