Category · AI

Inference

Running an AI model on demand (as opposed to training).

The dominant production cost for an LLM: every request burns GPU-seconds. Optimization levers: prompt caching, batching, quantization, model routing (Claude Haiku for simple queries, Sonnet for complex ones), self-hosted vLLM.

// In action with our clients

Client cases

/10European bank: hybrid RAG, inference cost cut tenfold.
< 5sEnergy utility: 50,000 sensors, real-time detection, €2.4M saved.

IARAG in production: from €10,000 to €900 per month.

// See also

LLM (Large Language Model)
Large-scale language model trained on massive text corpora.
Token
Text chunking unit for an LLM (~0.75 word in English).

Want us to apply this for you?

Talk to an architect