Why do 90% of AI projects fail to reach production?

Three dominant causes: (1) unmeasured accuracy (no CI evals, the model silently regresses), (2) absence of cost guards (a looping agent can burn €10k overnight), (3) security/compliance not baked-in (GDPR, banking secrecy, prompt-injection leaks). The 7 patterns address those three gaps.

What tools for LLM evals in CI?

Promptfoo + Langfuse form a solid base: Promptfoo for dataset-vs-judge tests in the GitHub Actions pipeline, Langfuse for continuous production tracking. LangSmith is the SaaS one-stop alternative if you're already on LangChain. Helicone for lightweight multi-provider observability.

How to limit a production LLM agent's cost?

Pattern 5 (Cost guards): LiteLLM as routing layer to dynamically switch Claude Haiku (cheap) ↔ Sonnet (medium) ↔ Opus (premium) by request complexity. On a Customer Support project, we divided LLM cost by 3.4× in 6 weeks without quality degradation.

Which vector database to choose in 2026?

Pinecone remains our default for >10M-vector workloads with strict SLAs. PostgreSQL + pgvector for <1M workloads where Postgres is already in prod (95% of cases at our FinTech clients). Qdrant if self-hosted GPU deployment with complex filters is needed.

How to guarantee AI privacy facing GDPR / APPI?

Pattern 7 (Privacy by design): self-hosted LLM via vLLM or Modal for sensitive use cases (health, banking, JP APPI), with Llama 3.3 70B as open fallback. Prompt-side anonymisation via Microsoft Presidio. No PII sent to OpenAI/Anthropic without enterprise contract + signed DPA.

← Insights

7 patterns for AI agents in production (no demo theater).

Q: How to guarantee AI privacy facing GDPR / APPI?

Pattern 7 (Privacy by design): self-hosted LLM via vLLM or Modal for sensitive use cases (health, banking, JP APPI), with Llama 3.3 70B as open fallback. Prompt-side anonymisation via Microsoft Presidio. No PII sent to OpenAI/Anthropic without enterprise contract + signed DPA.

Real-world patterns from RAG, agents and MLOps deployments. Senior teams shipping AI from POC to prod across Paris, Montréal, Tokyo.

Published 12 April 2026Updated 19 May 20269 min

You shipped an AI agent in demo. The board applauded. Three months later it hallucinates in production, your OpenAI bill tripled, and nobody knows why NPS is dropping. Welcome to PoC theatre, the purgatory where 80% of GenAI projects get stuck.

An agent that holds in prod is not a clever prompt and a Python wrapper. It's a distributed system with its own SLOs, its quality pipelines, its cost governance. Here are the seven patterns we systematically deploy at Abbeal to move from notebook to revenue.

1. Eval first: no feature without a dataset

Not a single line of prompt ships to prod without an eval suite. Build a golden dataset of 200 to 500 labeled examples, then automate measurement: exactness, faithfulness, latency, cost per request. Any PR that degrades an axis is blocked in CI. Without this discipline, you're flying blind. Concrete 2026 case: on a FinTech RAG project, we moved accuracy from 64% to 91% over 6 sprints — without switching LLM, just by industrialising evals and tuning retrieval.

2. Cost tracking per request, per tenant, per feature

A multi-tool agent can fire 40 LLM calls for a single user query. Without fine-grained telemetry, your margin melts. Log the cost of each span in Datadog or Honeycomb. Attribute by customer_id and by feature. That's the prerequisite to bill or rate-limit. LiteLLM has become our routing layer of choice: it dynamically switches Haiku ↔ Sonnet ↔ Opus by request complexity. On a Customer Support project, we cut LLM cost by 3.4× in 6 weeks without quality degradation — purely through cost routing.

python
from langgraph.graph import StateGraph
from anthropic import Anthropic
import datadog

client = Anthropic()

def call_llm(state):
    resp = client.messages.create(...)
    cost = resp.usage.input_tokens * 3e-6 + resp.usage.output_tokens * 15e-6
    datadog.statsd.histogram(
        'llm.cost.usd',
        cost,
        tags=[f'tenant:{state.tenant}', f'feature:{state.feature}']
    )
    return {'response': resp.content}

3. Fallback chains: no model is always available

Anthropic goes down. OpenAI rate-limits. You need a plan B wired in code, not in a Notion doc. Define a cascade: Claude Sonnet 4 → GPT-4o → self-hosted Llama 3. With circuit breaker and retry budget. Resilience isn't optional for a critical product.

4. Guardrails on input and output

Filter incoming prompts: injections, PII, out-of-scope content. Filter outputs: detectable hallucinations, forbidden data, off-brand tone. A Llama Guard guardrail or a homegrown classifier costs 50 ms. Cheaper than a GDPR incident.

5. Model registry and versioning

Your prompts are code. They live in Git, versioned, peer-reviewed, deployed via pipeline. A model registry (MLflow, Weights & Biases, or homegrown) tracks: prompt version, model version, eval metrics, rollback rate. Without it, a junior can break prod by editing a YAML file.

6. Native observability, not duct tape

Trace every agent run with OpenTelemetry. Spans for each tool call, each LLM call, each retry. LangSmith, Langfuse, or Datadog LLM Observability do the job. You want to answer "why did this user get a terrible response yesterday at 2:32 PM" in 30 seconds, not three hours of grep.

7. Feedback loop: prod feeds the eval

Capture thumbs up, manual corrections, session abandonment. Re-inject those signals into your eval dataset every week. An AI agent in prod is a living organism: without a feedback loop, it drifts.

« An AI agent without observability is a junior dev with root access and no logs. You discover the damage by reading the press. »

— Lead Engineer Abbeal

These seven patterns are non-negotiable. They turn a seductive PoC into a system that holds at 99.9% and scales without incinerating your runway. If you want to leave PoC theatre and industrialize your agents, let's talk.

2026 AI agents stack — what we actually use

Beyond patterns, here's the toolchain we build on with clients as of May 2026. Everything is battle-tested across at least 3 production missions, no POC.

LLM providers: Anthropic Claude 3.7 Sonnet (default), OpenAI GPT-4o (fallback), Llama 3.3 70B self-hosted via vLLM or Modal for sensitive workloads (health, banking, JP APPI).
Agent orchestration: LangGraph — explicit state machine (nodes + edges) rather than a free-form agent deciding everything alone. Much more debuggable + reproducible.
CI evals: Promptfoo in the GitHub Actions pipeline (dataset-vs-judge tests). Langfuse for continuous production tracking. LangSmith if already on LangChain.
Observability: Langfuse + Helicone for lightweight multi-provider, Datadog LLM Observability for Enterprise missions.
Vector store: Pinecone for >10M-vector workloads with strict SLAs. PostgreSQL + pgvector for <1M when Postgres is already in prod (95% of FinTech cases).
Routing & cost: LiteLLM as multi-LLM abstraction layer (provider switching + per-request cost capping).

Working on something similar?

Talk to an architect