IA
Embeddings: What They Are and How to Use Them in an Enterprise RAG
Vectors, semantic similarity, model selection, chunking, multilingual pitfalls FR/JA. The practical guide to building retrieval that holds up in production.
You built your first RAG, connected an LLM to your internal docs, and… the answers are off the mark. The model hallucinates on topics clearly present in the knowledge base. The problem? 9 times out of 10, it's not the LLM that's failing—it's the retrieval. And at the heart of retrieval lie embeddings.
Embeddings are the invisible backbone of any production RAG. They transform your documents into vectors, enable semantic search, and determine the quality of what you inject into the LLM's context. Misconfigure them, and the entire chain collapses.
This guide covers what we wish we'd known before putting our first RAGs into production: the intuition behind embeddings, how to choose your model, chunk your documents, handle multilingual content (including Japanese, which breaks everything), and above all, how to evaluate that you're not building on sand.
What is an embedding, concretely?
An embedding is a vector of numbers (typically 768 to 3072 dimensions) that represents the meaning of a piece of text. Two semantically close sentences end up close in vector space, even if they share no words.
Example: "How do I cancel my contract?" and "Subscription cancellation procedure" have virtually no terms in common, but their embeddings will be neighbors. This is what allows the RAG to retrieve the right document even when the user phrases their question differently.
Technically, an embedding model is a neural network (often a transformer) trained to project text into this vector space. You run it once over all document chunks (indexing), then use it at each user query to encode the question and search for the closest chunks by cosine similarity.
Choosing your embedding model: dimensions, domain, costs
More dimensions ≠ automatically better
The temptation is strong to take the model with the most dimensions on the MTEB leaderboard. In practice, we always benchmark 2-3 models on real business questions before deciding.
A 1024-dimension model well-adapted to your domain (legal, medical, technical) regularly beats a generic 3072-dimension model. And infrastructure costs aren't neutral:
- Storage: 3× more dimensions = 3× more storage in the vector DB
- RAM: HNSW indexes load everything into memory; on 10M chunks, that matters
- Latency: similarity computation on 3072 dims is slower, especially at scale
Models to consider (end of 2024)
A few options we use in production, by use case:
- text-embedding-3-large / text-embedding-3-small (OpenAI): solid, multilingual, managed API. Not open-source, cost per token.
- intfloat/e5-mistral-7b-instruct: open-source, 4096 dim, very good on complex tasks. Heavy to self-host.
- sentence-transformers/all-MiniLM-L6-v2: lightweight (384 dim), fast, sufficient for many simple English use cases.
- multilingual-e5-large: 1024 dim, truly multilingual (including Japanese), good performance/cost compromise.
We favor models with a context window >= 512 tokens to be able to embed substantial chunks without truncating them.
Self-host vs managed API
OpenAI embeddings: simple, no infrastructure to manage, predictable pricing. But your data leaves, and you're exposed to rate limits / downtimes.
Self-host (HuggingFace + vLLM / TEI): full control, fixed costs once infrastructure is paid for, no data leak. But you manage scaling, monitoring, versions. At high volume (>100k queries/day), self-hosting becomes cost-effective.
Chunking: splitting without breaking meaning
Chunking is the Achilles' heel of 90% of the RAGs we audit. An embedding only encodes what's in the chunk. If you cut a sentence in two or a concept spans 3 disjoint chunks, retrieval fails.
Chunking strategies
- Fixed-size + overlap: 512-token chunks with 50-token overlap. Simple, generic, often sufficient for technical documentation.
- Semantic chunking: detect topic transitions (consecutive embeddings that diverge) to split. Cleaner, but slower to index.
- Structure-aware: respect markdown boundaries (headers, lists), PDF sections, paragraphs. Preserves hierarchical context.
- Hybrid: fixed-size with fallback to sentence boundaries. Speed/quality compromise.
In practice, we start with fixed-size + overlap, measure recall, and iterate if necessary. On structured corpora (internal wikis, technical docs), structure-aware brings a net gain.
Overlap and context
Overlap prevents a sentence cut in two from never being retrieved. 10-20% of chunk size is a good starting point. Too much overlap = chunk count explosion = storage/latency costs.
We also often add parent context to the chunk (section title, document name) so the embedding captures "this chunk is about billing, not support." This can be a simple prefix: [Billing] <chunk content>.
Vector databases: storage and search
Once embeddings are computed, they need to be stored and queryable by cosine similarity in <50ms. That's the vector DB's job.
Which vector DB?
Options are multiplying; here are the ones we use depending on context:
- pgvector (Postgres): Postgres extension, perfect if you already have Postgres, relational metadata + vectors in the same database. Scales to ~1M vectors without issues.
- Qdrant: open-source, fast, easy self-host (Docker), rich filters, good monitoring. Our default choice for self-hosting.
- Pinecone / Weaviate cloud: managed, auto-scaling, good DX. Cost per vector stored + queries, can get expensive quickly.
- Milvus: very scalable (clusters), overkill for <10M vectors, useful if targeting hundreds of millions.
For 80% of enterprise RAGs (<5M chunks), pgvector or Qdrant are more than sufficient.
HNSW index and filters
The dominant search algorithm is HNSW (Hierarchical Navigable Small World). It builds a multi-layer graph for approximate searches in log(n). Precision/speed trade-off configurable via ef_construction and ef_search.
In production, we often combine vector similarity + metadata filters: "retrieval in Finance department docs, FR language, less than a year old." The vector DB must support these filters without killing performance (pre-filtering or post-filtering depending on the DB).
Multilingual: FR, EN, JA… and the pitfalls of Japanese
A multilingual corpus breaks many implicit assumptions of English-centric RAG pipelines.
Multilingual embedding model
You need a truly multilingual model, trained on your target languages. English-centric models often project French into a degraded subspace, and Japanese… even worse.
We always validate cross-lingual retrieval: a question in French should be able to retrieve an English document if it's the right context. Some models poorly align vector spaces across languages.
Japanese breaks everything
Japanese has no spaces, byte-level tokenizers (GPT-type) over-segment, and sentence-based chunking strategies require a dedicated segmenter (no reliable period).
- Tokenizer: use a model with an appropriate tokenizer (multilingual SentencePiece, or integrated Japanese tokenizer).
- Chunking: we use a splitter like
fugashiorsudachipyto detect sentence boundaries. - Overlap: more critical in Japanese, as cutting in the middle of a compound word destroys meaning.
On a recent project with FR/EN/JA corpus, we had to switch from text-embedding-ada-002 (mediocre JP performance) to multilingual-e5-large + custom chunking per language. Recall@5 on JP questions went from 45% to 78%.
Evaluating retrieval quality
An embedding is never good or bad in absolute terms. We measure the quality of the retrieval it enables. Without eval, impossible to compare two models, two chunking strategies, or detect regression after a change.
Retrieval metrics
- Recall@k: of the top k chunks returned, what proportion contains the correct answer? Recall@5 is the standard.
- MRR (Mean Reciprocal Rank): on average, at what rank does the first correct chunk appear? Higher is better.
- NDCG: takes result order into account, useful if you have graded relevances (relevant / very relevant).
We prefer Recall@5 + MRR: simple to understand for POs, discriminating enough to decide between technical options.
Building the eval set
The eval set is 50-200 pairs (question, expected reference chunks). We build it with domain experts, starting from real user questions (support logs, FAQ, user research).
Minimal format: a CSV question | doc_id | chunk_id | language. We run each question through retrieval, measure if expected chunks are in the top-k.
py# Simplified recall@k eval example def evaluate_recall(questions, vector_db, k=5): hits = 0 for q in questions: results = vector_db.search(q['text'], top_k=k) retrieved_ids = [r['chunk_id'] for r in results] if q['expected_chunk_id'] in retrieved_ids: hits += 1 return hits / len(questions)
We run this eval in CI on every embedding model change, chunking strategy change, or corpus update. A regression >5 recall points = alert.
Common pitfalls and field feedback
Embedding metadata… or not?
Some add title, author, date into the embedded text. Advantage: the embedding captures "this chunk is about HR." Disadvantage: pollution if metadata is repetitive or non-semantic (technical IDs).
We prefer storing metadata separately and using vector DB filters. The embedding remains purely semantic, we filter afterward by department/language/date.
Reranking: the second pass
Vector retrieval quickly returns ~20-50 candidates. We can then apply a reranker (cross-encoder) that finely scores each (question, chunk) pair and keeps the top-5.
The reranker is slower but more precise. It improves MRR by 10-20% on complex questions. Models: cross-encoder/ms-marco-MiniLM, jina-reranker-v1, or via Cohere Rerank API.
Corpus drift and embedding refresh
The corpus evolves (new docs, updates). Modified chunks need re-embedding. We track doc_id + content hash, and re-index in delta rather than full reindex.
Warning: changing embedding model = re-indexing the entire corpus. Vectors from two models don't live in the same space; you can't mix them.
Costs and sizing
Some orders of magnitude on a corpus of 1 million chunks (512 tokens/chunk, 1024-dim embedding):
- One-shot indexing: ~4-8h on V100 GPU (self-host) or ~$50-100 in API calls (OpenAI embeddings).
- Vector DB storage: ~4-6 GB in RAM (HNSW index), ~2-3 GB on disk (Qdrant).
- Retrieval latency: <30ms @ p95 for top-20, on standard infrastructure (16 vCPU, 32 GB RAM).
Scaling to 10M chunks: we move to a Qdrant or Milvus cluster, with sharding. Latency stays <50ms if the index fits in distributed RAM.
In reality, a RAG is 80% retrieval
The final LLM only reformulates what retrieval gives it. If returned chunks are off-topic, even GPT-4 will hallucinate. Solid retrieval (adapted embeddings, clean chunking, continuous eval) is the difference between a POC that impresses in demos and a system that holds up in production under load and edge cases.
At Abbeal, we build RAGs for demanding environments: multilingual, large corpora, tight latency, data compliance. If you're scaling your RAG and hitting walls on retrieval, multilingual chunking pitfalls, or continuous evaluation, [let's talk](https://abbeal.com/contact). We've already made the mistakes for you.
// Read next
IA
How I automated a tech consulting CEO's day with Claude (and what you can learn from it).
30 workflows orchestrated on Notion + BoondManager + Google Workspace + LinkedIn + Apollo + Calendly + Tactiq, no new SaaS. 4 pillars: multichannel anti-duplicate sales, 48h recruitment, inbound SEO/LinkedIn/AI citations, founder productivity. Zero lost leads in 6 months, 15 min/day vs 3-4h before.
7 min
IA
7 patterns for AI agents in production (no demo theater).
Real-world patterns from RAG, agents and MLOps deployments. Senior teams shipping AI from POC to prod across Paris, Montréal, Tokyo.
9 min
IA
RAG in production: from €10,000 to €900 per month.
A European bank, a RAG pipeline, a hybrid strategy. How we cut inference costs by ten.
8 min
