What is an embedding, in one sentence?

An embedding is a vector of numbers (often 768 to 3072 dimensions) that captures the meaning of a piece of text or media, so that two semantically close items end up close in vector space. It is the building block that makes similarity search in a RAG possible.

Should I always pick the embedding model with the most dimensions?

No. More dimensions can mean finer retrieval, but also more storage, more RAM in the vector DB and slower queries. On our projects we benchmark two or three models on the real business questions before deciding. A well-fitted 1024-dimension model often beats a generic 3072-dimension one.

How do I handle a multilingual EN / FR / JA corpus?

You need a genuinely multilingual embedding model, tested on your actual languages. Japanese in particular breaks many tokenizers and chunking strategies designed for English. We always validate cross-lingual retrieval with a set of questions in each language before going live.

How do I know if my embeddings are good enough?

You never judge an embedding in isolation. You measure the retrieval it enables: recall@k, MRR and a reference question/answer set built with domain experts. Without that eval set you cannot compare two models or detect a regression after a chunking change.

← Insights

Embeddings: What They Are and How to Use Them in an Enterprise RAG

Vectors, semantic similarity, model selection, chunking, multilingual pitfalls FR/JA. The practical guide to building retrieval that holds up in production.

Published 16 June 2026Updated 22 June 202610 min

You built your first RAG, connected an LLM to your internal docs, and… the answers are off the mark. The model hallucinates on topics clearly present in the knowledge base. The problem? 9 times out of 10, it's not the LLM that's failing—it's the retrieval. And at the heart of retrieval lie embeddings.

Embeddings are the invisible backbone of any production RAG. They transform your documents into vectors, enable semantic search, and determine the quality of what you inject into the LLM's context. Misconfigure them, and the entire chain collapses.

This guide covers what we wish we'd known before putting our first RAGs into production: the intuition behind embeddings, how to choose your model, chunk your documents, handle multilingual content (including Japanese, which breaks everything), and above all, how to evaluate that you're not building on sand.

What is an embedding, concretely?

An embedding is a vector of numbers (typically 768 to 3072 dimensions) that represents the meaning of a piece of text. Two semantically close sentences end up close in vector space, even if they share no words.

Example: "How do I cancel my contract?" and "Subscription cancellation procedure" have virtually no terms in common, but their embeddings will be neighbors. This is what allows the RAG to retrieve the right document even when the user phrases their question differently.

Technically, an embedding model is a neural network (often a transformer) trained to project text into this vector space. You run it once over all document chunks (indexing), then use it at each user query to encode the question and search for the closest chunks by cosine similarity.

Choosing your embedding model: dimensions, domain, costs

More dimensions ≠ automatically better

The temptation is strong to take the model with the most dimensions on the MTEB leaderboard. In practice, we always benchmark 2-3 models on real business questions before deciding.

A 1024-dimension model well-adapted to your domain (legal, medical, technical) regularly beats a generic 3072-dimension model. And infrastructure costs aren't neutral:

Storage: 3× more dimensions = 3× more storage in the vector DB
RAM: HNSW indexes load everything into memory; on 10M chunks, that matters
Latency: similarity computation on 3072 dims is slower, especially at scale

Models to consider (end of 2024)

A few options we use in production, by use case:

text-embedding-3-large / text-embedding-3-small (OpenAI): solid, multilingual, managed API. Not open-source, cost per token.
intfloat/e5-mistral-7b-instruct: open-source, 4096 dim, very good on complex tasks. Heavy to self-host.
sentence-transformers/all-MiniLM-L6-v2: lightweight (384 dim), fast, sufficient for many simple English use cases.
multilingual-e5-large: 1024 dim, truly multilingual (including Japanese), good performance/cost compromise.

We favor models with a context window >= 512 tokens to be able to embed substantial chunks without truncating them.

Self-host vs managed API

OpenAI embeddings: simple, no infrastructure to manage, predictable pricing. But your data leaves, and you're exposed to rate limits / downtimes.

Self-host (HuggingFace + vLLM / TEI): full control, fixed costs once infrastructure is paid for, no data leak. But you manage scaling, monitoring, versions. At high volume (>100k queries/day), self-hosting becomes cost-effective.

Chunking: splitting without breaking meaning

Chunking is the Achilles' heel of 90% of the RAGs we audit. An embedding only encodes what's in the chunk. If you cut a sentence in two or a concept spans 3 disjoint chunks, retrieval fails.

Chunking strategies

Fixed-size + overlap: 512-token chunks with 50-token overlap. Simple, generic, often sufficient for technical documentation.
Semantic chunking: detect topic transitions (consecutive embeddings that diverge) to split. Cleaner, but slower to index.
Structure-aware: respect markdown boundaries (headers, lists), PDF sections, paragraphs. Preserves hierarchical context.
Hybrid: fixed-size with fallback to sentence boundaries. Speed/quality compromise.

In practice, we start with fixed-size + overlap, measure recall, and iterate if necessary. On structured corpora (internal wikis, technical docs), structure-aware brings a net gain.

Overlap and context

Overlap prevents a sentence cut in two from never being retrieved. 10-20% of chunk size is a good starting point. Too much overlap = chunk count explosion = storage/latency costs.

We also often add parent context to the chunk (section title, document name) so the embedding captures "this chunk is about billing, not support." This can be a simple prefix: [Billing] <chunk content>.

Vector databases: storage and search

Once embeddings are computed, they need to be stored and queryable by cosine similarity in <50ms. That's the vector DB's job.

Which vector DB?

Options are multiplying; here are the ones we use depending on context:

pgvector (Postgres): Postgres extension, perfect if you already have Postgres, relational metadata + vectors in the same database. Scales to ~1M vectors without issues.
Qdrant: open-source, fast, easy self-host (Docker), rich filters, good monitoring. Our default choice for self-hosting.
Pinecone / Weaviate cloud: managed, auto-scaling, good DX. Cost per vector stored + queries, can get expensive quickly.
Milvus: very scalable (clusters), overkill for <10M vectors, useful if targeting hundreds of millions.

For 80% of enterprise RAGs (<5M chunks), pgvector or Qdrant are more than sufficient.

HNSW index and filters

The dominant search algorithm is HNSW (Hierarchical Navigable Small World). It builds a multi-layer graph for approximate searches in log(n). Precision/speed trade-off configurable via ef_construction and ef_search.

In production, we often combine vector similarity + metadata filters: "retrieval in Finance department docs, FR language, less than a year old." The vector DB must support these filters without killing performance (pre-filtering or post-filtering depending on the DB).

Multilingual: FR, EN, JA… and the pitfalls of Japanese

A multilingual corpus breaks many implicit assumptions of English-centric RAG pipelines.

Multilingual embedding model

You need a truly multilingual model, trained on your target languages. English-centric models often project French into a degraded subspace, and Japanese… even worse.

We always validate cross-lingual retrieval: a question in French should be able to retrieve an English document if it's the right context. Some models poorly align vector spaces across languages.

Japanese breaks everything

Japanese has no spaces, byte-level tokenizers (GPT-type) over-segment, and sentence-based chunking strategies require a dedicated segmenter (no reliable period).

Tokenizer: use a model with an appropriate tokenizer (multilingual SentencePiece, or integrated Japanese tokenizer).
Chunking: we use a splitter like fugashi or sudachipy to detect sentence boundaries.
Overlap: more critical in Japanese, as cutting in the middle of a compound word destroys meaning.

On a recent project with FR/EN/JA corpus, we had to switch from text-embedding-ada-002 (mediocre JP performance) to multilingual-e5-large + custom chunking per language. Recall@5 on JP questions went from 45% to 78%.

Evaluating retrieval quality

An embedding is never good or bad in absolute terms. We measure the quality of the retrieval it enables. Without eval, impossible to compare two models, two chunking strategies, or detect regression after a change.

Retrieval metrics

Recall@k: of the top k chunks returned, what proportion contains the correct answer? Recall@5 is the standard.
MRR (Mean Reciprocal Rank): on average, at what rank does the first correct chunk appear? Higher is better.
NDCG: takes result order into account, useful if you have graded relevances (relevant / very relevant).

We prefer Recall@5 + MRR: simple to understand for POs, discriminating enough to decide between technical options.

Building the eval set

The eval set is 50-200 pairs (question, expected reference chunks). We build it with domain experts, starting from real user questions (support logs, FAQ, user research).

Minimal format: a CSV question | doc_id | chunk_id | language. We run each question through retrieval, measure if expected chunks are in the top-k.

py
# Simplified recall@k eval example
def evaluate_recall(questions, vector_db, k=5):
    hits = 0
    for q in questions:
        results = vector_db.search(q['text'], top_k=k)
        retrieved_ids = [r['chunk_id'] for r in results]
        if q['expected_chunk_id'] in retrieved_ids:
            hits += 1
    return hits / len(questions)

We run this eval in CI on every embedding model change, chunking strategy change, or corpus update. A regression >5 recall points = alert.

Common pitfalls and field feedback

Embedding metadata… or not?

Some add title, author, date into the embedded text. Advantage: the embedding captures "this chunk is about HR." Disadvantage: pollution if metadata is repetitive or non-semantic (technical IDs).

We prefer storing metadata separately and using vector DB filters. The embedding remains purely semantic, we filter afterward by department/language/date.

Reranking: the second pass

Vector retrieval quickly returns ~20-50 candidates. We can then apply a reranker (cross-encoder) that finely scores each (question, chunk) pair and keeps the top-5.

The reranker is slower but more precise. It improves MRR by 10-20% on complex questions. Models: cross-encoder/ms-marco-MiniLM, jina-reranker-v1, or via Cohere Rerank API.

Corpus drift and embedding refresh

The corpus evolves (new docs, updates). Modified chunks need re-embedding. We track doc_id + content hash, and re-index in delta rather than full reindex.

Warning: changing embedding model = re-indexing the entire corpus. Vectors from two models don't live in the same space; you can't mix them.

Costs and sizing

Some orders of magnitude on a corpus of 1 million chunks (512 tokens/chunk, 1024-dim embedding):

One-shot indexing: ~4-8h on V100 GPU (self-host) or ~$50-100 in API calls (OpenAI embeddings).
Vector DB storage: ~4-6 GB in RAM (HNSW index), ~2-3 GB on disk (Qdrant).
Retrieval latency: <30ms @ p95 for top-20, on standard infrastructure (16 vCPU, 32 GB RAM).

Scaling to 10M chunks: we move to a Qdrant or Milvus cluster, with sharding. Latency stays <50ms if the index fits in distributed RAM.

In reality, a RAG is 80% retrieval

The final LLM only reformulates what retrieval gives it. If returned chunks are off-topic, even GPT-4 will hallucinate. Solid retrieval (adapted embeddings, clean chunking, continuous eval) is the difference between a POC that impresses in demos and a system that holds up in production under load and edge cases.

At Abbeal, we build RAGs for demanding environments: multilingual, large corpora, tight latency, data compliance. If you're scaling your RAG and hitting walls on retrieval, multilingual chunking pitfalls, or continuous evaluation, [let's talk](https://abbeal.com/contact). We've already made the mistakes for you.

// Read next

Working on something similar?

Talk to an architect