Abbeal

Category · AI

Inference

Running an AI model on demand (as opposed to training).

The dominant production cost for an LLM: every request burns GPU-seconds. Optimization levers: prompt caching, batching, quantization, model routing (Claude Haiku for simple queries, Sonnet for complex ones), self-hosted vLLM.

// In action with our clients

// See also

Want us to apply this for you?

Talk to an architect