Abbeal

IA

AI agents in production: avoiding the demo theatre.

Reliability, cost, security, evaluation. Seven patterns we actually use with our clients.

8 min

You shipped an AI agent in demo. The board applauded. Three months later it hallucinates in production, your OpenAI bill tripled, and nobody knows why NPS is dropping. Welcome to PoC theatre, the purgatory where 80% of GenAI projects get stuck.

An agent that holds in prod is not a clever prompt and a Python wrapper. It's a distributed system with its own SLOs, its quality pipelines, its cost governance. Here are the seven patterns we systematically deploy at Abbeal to move from notebook to revenue.

1. Eval first: no feature without a dataset

Not a single line of prompt ships to prod without an eval suite. Build a golden dataset of 200 to 500 labeled examples, then automate measurement: exactness, faithfulness, latency, cost per request. Any PR that degrades an axis is blocked in CI. Without this discipline, you're flying blind.

2. Cost tracking per request, per tenant, per feature

A multi-tool agent can fire 40 LLM calls for a single user query. Without fine-grained telemetry, your margin melts. Log the cost of each span in Datadog or Honeycomb. Attribute by customer_id and by feature. That's the prerequisite to bill or rate-limit.

python
from langgraph.graph import StateGraph from anthropic import Anthropic import datadog client = Anthropic() def call_llm(state): resp = client.messages.create(...) cost = resp.usage.input_tokens * 3e-6 + resp.usage.output_tokens * 15e-6 datadog.statsd.histogram( 'llm.cost.usd', cost, tags=[f'tenant:{state.tenant}', f'feature:{state.feature}'] ) return {'response': resp.content}

3. Fallback chains: no model is always available

Anthropic goes down. OpenAI rate-limits. You need a plan B wired in code, not in a Notion doc. Define a cascade: Claude Sonnet 4 → GPT-4o → self-hosted Llama 3. With circuit breaker and retry budget. Resilience isn't optional for a critical product.

4. Guardrails on input and output

Filter incoming prompts: injections, PII, out-of-scope content. Filter outputs: detectable hallucinations, forbidden data, off-brand tone. A Llama Guard guardrail or a homegrown classifier costs 50 ms. Cheaper than a GDPR incident.

5. Model registry and versioning

Your prompts are code. They live in Git, versioned, peer-reviewed, deployed via pipeline. A model registry (MLflow, Weights & Biases, or homegrown) tracks: prompt version, model version, eval metrics, rollback rate. Without it, a junior can break prod by editing a YAML file.

6. Native observability, not duct tape

Trace every agent run with OpenTelemetry. Spans for each tool call, each LLM call, each retry. LangSmith, Langfuse, or Datadog LLM Observability do the job. You want to answer "why did this user get a terrible response yesterday at 2:32 PM" in 30 seconds, not three hours of grep.

7. Feedback loop: prod feeds the eval

Capture thumbs up, manual corrections, session abandonment. Re-inject those signals into your eval dataset every week. An AI agent in prod is a living organism: without a feedback loop, it drifts.

« An AI agent without observability is a junior dev with root access and no logs. You discover the damage by reading the press. »
Lead Engineer Abbeal

These seven patterns are non-negotiable. They turn a seductive PoC into a system that holds at 99.9% and scales without incinerating your runway. If you want to leave PoC theatre and industrialize your agents, let's talk.

Working on something similar?

Talk to an architect