IA
AI agents in production: avoiding the demo theatre.
Reliability, cost, security, evaluation. Seven patterns we actually use with our clients.
You shipped an AI agent in demo. The board applauded. Three months later it hallucinates in production, your OpenAI bill tripled, and nobody knows why NPS is dropping. Welcome to PoC theatre, the purgatory where 80% of GenAI projects get stuck.
An agent that holds in prod is not a clever prompt and a Python wrapper. It's a distributed system with its own SLOs, its quality pipelines, its cost governance. Here are the seven patterns we systematically deploy at Abbeal to move from notebook to revenue.
1. Eval first: no feature without a dataset
Not a single line of prompt ships to prod without an eval suite. Build a golden dataset of 200 to 500 labeled examples, then automate measurement: exactness, faithfulness, latency, cost per request. Any PR that degrades an axis is blocked in CI. Without this discipline, you're flying blind.
2. Cost tracking per request, per tenant, per feature
A multi-tool agent can fire 40 LLM calls for a single user query. Without fine-grained telemetry, your margin melts. Log the cost of each span in Datadog or Honeycomb. Attribute by customer_id and by feature. That's the prerequisite to bill or rate-limit.
pythonfrom langgraph.graph import StateGraph from anthropic import Anthropic import datadog client = Anthropic() def call_llm(state): resp = client.messages.create(...) cost = resp.usage.input_tokens * 3e-6 + resp.usage.output_tokens * 15e-6 datadog.statsd.histogram( 'llm.cost.usd', cost, tags=[f'tenant:{state.tenant}', f'feature:{state.feature}'] ) return {'response': resp.content}
3. Fallback chains: no model is always available
Anthropic goes down. OpenAI rate-limits. You need a plan B wired in code, not in a Notion doc. Define a cascade: Claude Sonnet 4 → GPT-4o → self-hosted Llama 3. With circuit breaker and retry budget. Resilience isn't optional for a critical product.
4. Guardrails on input and output
Filter incoming prompts: injections, PII, out-of-scope content. Filter outputs: detectable hallucinations, forbidden data, off-brand tone. A Llama Guard guardrail or a homegrown classifier costs 50 ms. Cheaper than a GDPR incident.
5. Model registry and versioning
Your prompts are code. They live in Git, versioned, peer-reviewed, deployed via pipeline. A model registry (MLflow, Weights & Biases, or homegrown) tracks: prompt version, model version, eval metrics, rollback rate. Without it, a junior can break prod by editing a YAML file.
6. Native observability, not duct tape
Trace every agent run with OpenTelemetry. Spans for each tool call, each LLM call, each retry. LangSmith, Langfuse, or Datadog LLM Observability do the job. You want to answer "why did this user get a terrible response yesterday at 2:32 PM" in 30 seconds, not three hours of grep.
7. Feedback loop: prod feeds the eval
Capture thumbs up, manual corrections, session abandonment. Re-inject those signals into your eval dataset every week. An AI agent in prod is a living organism: without a feedback loop, it drifts.
« An AI agent without observability is a junior dev with root access and no logs. You discover the damage by reading the press. »
These seven patterns are non-negotiable. They turn a seductive PoC into a system that holds at 99.9% and scales without incinerating your runway. If you want to leave PoC theatre and industrialize your agents, let's talk.
// Read next
GreenOps
GreenOps: seven levers that cut 30% of your cloud bill.
Without sacrificing performance. Concrete cases: -30% on the bill, same SLOs.
6 min
Tech radar
Tech Radar 2026: why Rust and ROS 2 dominate.
Criteria, field reports, trade-offs. What we actually adopt vs what we assess.
10 min
Engineering
Follow-the-Sun: 24/7 without burning teams out.
Three time zones, three teams, a roadmap that moves while you sleep. How we actually operate it.
7 min
