IA
7 patterns for AI agents in production (no demo theater).
Real-world patterns from RAG, agents and MLOps deployments. Senior teams shipping AI from POC to prod across Paris, Montréal, Tokyo.
You shipped an AI agent in demo. The board applauded. Three months later it hallucinates in production, your OpenAI bill tripled, and nobody knows why NPS is dropping. Welcome to PoC theatre, the purgatory where 80% of GenAI projects get stuck.
An agent that holds in prod is not a clever prompt and a Python wrapper. It's a distributed system with its own SLOs, its quality pipelines, its cost governance. Here are the seven patterns we systematically deploy at Abbeal to move from notebook to revenue.
1. Eval first: no feature without a dataset
Not a single line of prompt ships to prod without an eval suite. Build a golden dataset of 200 to 500 labeled examples, then automate measurement: exactness, faithfulness, latency, cost per request. Any PR that degrades an axis is blocked in CI. Without this discipline, you're flying blind. Concrete 2026 case: on a FinTech RAG project, we moved accuracy from 64% to 91% over 6 sprints — without switching LLM, just by industrialising evals and tuning retrieval.
2. Cost tracking per request, per tenant, per feature
A multi-tool agent can fire 40 LLM calls for a single user query. Without fine-grained telemetry, your margin melts. Log the cost of each span in Datadog or Honeycomb. Attribute by customer_id and by feature. That's the prerequisite to bill or rate-limit. LiteLLM has become our routing layer of choice: it dynamically switches Haiku ↔ Sonnet ↔ Opus by request complexity. On a Customer Support project, we cut LLM cost by 3.4× in 6 weeks without quality degradation — purely through cost routing.
pythonfrom langgraph.graph import StateGraph from anthropic import Anthropic import datadog client = Anthropic() def call_llm(state): resp = client.messages.create(...) cost = resp.usage.input_tokens * 3e-6 + resp.usage.output_tokens * 15e-6 datadog.statsd.histogram( 'llm.cost.usd', cost, tags=[f'tenant:{state.tenant}', f'feature:{state.feature}'] ) return {'response': resp.content}
3. Fallback chains: no model is always available
Anthropic goes down. OpenAI rate-limits. You need a plan B wired in code, not in a Notion doc. Define a cascade: Claude Sonnet 4 → GPT-4o → self-hosted Llama 3. With circuit breaker and retry budget. Resilience isn't optional for a critical product.
4. Guardrails on input and output
Filter incoming prompts: injections, PII, out-of-scope content. Filter outputs: detectable hallucinations, forbidden data, off-brand tone. A Llama Guard guardrail or a homegrown classifier costs 50 ms. Cheaper than a GDPR incident.
5. Model registry and versioning
Your prompts are code. They live in Git, versioned, peer-reviewed, deployed via pipeline. A model registry (MLflow, Weights & Biases, or homegrown) tracks: prompt version, model version, eval metrics, rollback rate. Without it, a junior can break prod by editing a YAML file.
6. Native observability, not duct tape
Trace every agent run with OpenTelemetry. Spans for each tool call, each LLM call, each retry. LangSmith, Langfuse, or Datadog LLM Observability do the job. You want to answer "why did this user get a terrible response yesterday at 2:32 PM" in 30 seconds, not three hours of grep.
7. Feedback loop: prod feeds the eval
Capture thumbs up, manual corrections, session abandonment. Re-inject those signals into your eval dataset every week. An AI agent in prod is a living organism: without a feedback loop, it drifts.
« An AI agent without observability is a junior dev with root access and no logs. You discover the damage by reading the press. »
These seven patterns are non-negotiable. They turn a seductive PoC into a system that holds at 99.9% and scales without incinerating your runway. If you want to leave PoC theatre and industrialize your agents, let's talk.
2026 AI agents stack — what we actually use
Beyond patterns, here's the toolchain we build on with clients as of May 2026. Everything is battle-tested across at least 3 production missions, no POC.
- **LLM providers**: Anthropic Claude 3.7 Sonnet (default), OpenAI GPT-4o (fallback), Llama 3.3 70B self-hosted via vLLM or Modal for sensitive workloads (health, banking, JP APPI).
- **Agent orchestration**: LangGraph — explicit state machine (nodes + edges) rather than a free-form agent deciding everything alone. Much more debuggable + reproducible.
- **CI evals**: Promptfoo in the GitHub Actions pipeline (dataset-vs-judge tests). Langfuse for continuous production tracking. LangSmith if already on LangChain.
- **Observability**: Langfuse + Helicone for lightweight multi-provider, Datadog LLM Observability for Enterprise missions.
- **Vector store**: Pinecone for >10M-vector workloads with strict SLAs. PostgreSQL + pgvector for <1M when Postgres is already in prod (95% of FinTech cases).
- **Routing & cost**: LiteLLM as multi-LLM abstraction layer (provider switching + per-request cost capping).
// Read next
IA
How I automated a tech consulting CEO's day with Claude (and what you can learn from it).
30 workflows orchestrated on Notion + BoondManager + Google Workspace + LinkedIn + Apollo + Calendly + Tactiq, no new SaaS. 4 pillars: multichannel anti-duplicate sales, 48h recruitment, inbound SEO/LinkedIn/AI citations, founder productivity. Zero lost leads in 6 months, 15 min/day vs 3-4h before.
7 min
IA
RAG in production: from €10,000 to €900 per month.
A European bank, a RAG pipeline, a hybrid strategy. How we cut inference costs by ten.
8 min
