RAG in production: from €10,000 to €900 per month.

A European bank, a RAG pipeline, a hybrid strategy. How we cut inference costs by a factor of ten.

Published 10 March 2026Updated 22 June 20268 min

€10,000 per month in inference costs for a RAG pipeline processing a few thousand user queries. Not millions. A few thousand. That's the number we were presented with at the kick-off meeting with a mid-sized European bank, early 2024. Their RAG POC was running on GPT-4, the team loved the results, but the budget was exploding before they'd even opened the service to all employees.

Six months later, same load, same quality perceived by users: €900 per month. Here's how we did it, no bullshit, with real numbers and the trade-offs we made.

Spoiler: this isn't a success story where "we migrated everything to self-hosted and it's magic." It's a hybrid, pragmatic approach with deliberate compromises. Exactly what we do at Abbeal: software engineering that holds up in production.

The context: a classic RAG that costs too much

The bank had a legitimate use case: enable customer advisors to query a document base of several thousand regulatory documents, internal notes, and product sheets in natural language. A classic RAG: document embedding, vector search, response generation with context.

Initial POC stack:

Embedding: text-embedding-ada-002 from OpenAI
Vector store: Pinecone (Starter plan)
LLM: GPT-4 (8K context)
Orchestration: LangChain, hosted on Cloud Run

Volume: approximately 3,000 queries per month in pilot phase (50 early adopter users), with an average of 4,000 tokens per query (context + prompt + response). All for €10,000 monthly, 85% of which was GPT-4 calls.

Problem: impossible to scale to 500 users with this economic model. Projected budget of €80-100K per year just for inference. The business sponsor balked, the project was about to be put on hold.

Diagnosis: where does the money go?

Before optimizing, we instrumented. We added detailed logs on each API call: tokens in/out, latency, unit cost. First finding after a week of analysis:

70% of queries were simple, factual questions that didn't require GPT-4. Like "What's the rate for Livret A?" or "What's the deadline for form X?"
20% of queries required complex reasoning or multi-document synthesis. Real GPT-4 territory.
10% of queries were poorly formulated or out of scope, yet still generated an expensive call.

Second finding: the context injected into the prompt was often oversized. LangChain retrieved 6-8 chunks of 500 tokens each by default, even when 2 were enough. Result: artificially inflated input tokens.

Strategy #1: intelligent routing by complexity

We implemented a lightweight router that classifies each query into three categories: simple, medium, complex. No complicated fine-tuning, just a small classification model based on rules + a distilbert classifier trained on 500 manually labeled examples.

Simple → GPT-3.5-turbo (90% cheaper than GPT-4)
Medium → GPT-4-turbo (half the price of vanilla GPT-4, larger context window)
Complex → Classic GPT-4, but we reduced this bucket to less than 15% of queries

Development time: 3 days for the classifier, 2 days for the router. Immediate gain: -55% on inference costs. From €10,000 to €4,500 per month, with no perceived degradation in quality.

The router code (simplified)

py
from transformers import pipeline

classifier = pipeline("text-classification", model="./query-complexity-model")

def route_query(query: str) -> str:
    result = classifier(query)[0]
    label = result["label"]
    score = result["score"]
    
    if label == "SIMPLE" and score > 0.8:
        return "gpt-3.5-turbo"
    elif label == "COMPLEX" and score > 0.75:
        return "gpt-4"
    else:
        return "gpt-4-turbo"  # default safe choice

Nothing revolutionary. Runs in 20 ms on average, negligible cost.

Strategy #2: RAG context optimization

Second lever: reduce the number of tokens sent without degrading relevance. We reworked the retrieval chain:

Reranking: after vector search, we pass the top 10 chunks through a reranking model (cross-encoder from Sentence Transformers). We only keep the 2-3 truly relevant chunks.
Context compression: for long chunks, we added a lightweight summarization step with bart-large-cnn to condense the information without losing meaning.
Prompt engineering: we rewrote prompts to be more directive, avoid context repetition, and request more concise responses.

Result: average tokens per query dropped from 4,000 to 2,200. Additional gain: -45% on remaining costs. We're at around €2,500 per month.

Strategy #3: aggressive caching and self-hosted embeddings

We still had two levers to activate:

Semantic cache

We implemented a semantic cache: if a new query is sufficiently close to an already processed query (cosine similarity > 0.92 on embeddings), we serve the cached response. With Redis + RediSearch for vector indexing.

Hit rate after one month: 18%. Seems low, but on 3,000 queries, that's 540 LLM calls saved. About €300 in monthly savings, for a Redis that costs €50 per month.

Self-hosted embeddings

We replaced text-embedding-ada-002 (OpenAI) with bge-large-en-v1.5, an open-source model we deployed on a GPU VM (T4) at Google Cloud. VM cost: €150 per month. Savings on embedding calls: around €400 per month.

Net gain: €250 per month, plus independence from OpenAI for this component.

Strategy #4: fine-tuning a small model for FAQs

Last optimization, the most technical: we identified a cluster of ~200 recurring questions (internal FAQ type) that kept coming back. Instead of going through the full RAG pipeline, we fine-tuned a Mistral-7B on these 200 Q&As + 1,000 synthetic examples generated with GPT-4.

We serve this model via a CPU VM (no need for GPU for inference with Q4_K_M quantization). Latency: 800 ms average. Marginal cost: virtually zero, the VM already runs for other services.

The router detects if the query matches an FAQ (via a simple Elasticsearch index), and if so, switches to the fine-tuned Mistral. Covers about 12% of queries. Additional savings: €400 per month.

Bottom line: from €10,000 to €900

Summary of cumulative gains:

Intelligent routing: -€5,500
RAG context optimization: -€2,000
Semantic cache: -€300
Self-hosted embeddings: -€250
FAQ fine-tuning: -€400
Additional infrastructure costs (VM, Redis): +€350

Total: €900 per month, a 91% reduction. Total development time: approximately 6 person-weeks, spread over 4 months (we iterated carefully, with A/B tests at each step).

What we learned (and what we'd do differently)

Some concrete feedback:

Instrument before optimizing. We wasted 2 weeks optimizing the wrong things before having real numbers.
Intelligent routing is lever #1. If you only do one thing, do this. Immediate ROI, low risk.
Don't underestimate reranking. It's a game changer for RAG accuracy, and costs nothing.
Self-hosting: only do it if you already have MLOps skills in-house. Otherwise, the hidden cost (maintenance, monitoring, debugging) quickly eats up the gains.
Fine-tuning: only profitable if you have a cluster of truly recurring and well-identified queries. Otherwise, it's wasted time.

If we had to do it again, we would have started with routing + reranking from the POC. It would have avoided the budget stress mid-project.

Conclusion: hybrid > dogma

We see a lot of "all open source self-hosted" or conversely "all SaaS managed" rhetoric. The reality is that the hybrid approach is often the most pragmatic: managed APIs for complex use cases, self-hosted for commoditized components, targeted fine-tuning for recurring patterns.

This RAG project has now been in production at our client for 4 months, with 500+ active users, and the infrastructure budget stays under €1,200 per month. The internal team has taken over day-to-day maintenance, and we provide ad hoc support for evolutions.

If you have a RAG project that's blowing your budget, or if you're in POC phase and want to avoid this trap, we've refined this approach across 4-5 similar projects. At Abbeal, we don't sell consulting by the meter: we come to solve concrete problems, transfer skills, and leave when it's done. Let's talk.

// Read next

Working on something similar?

Talk to an architect