Category · AI
Inference
Running an AI model on demand (as opposed to training).
The dominant production cost for an LLM: every request burns GPU-seconds. Optimization levers: prompt caching, batching, quantization, model routing (Claude Haiku for simple queries, Sonnet for complex ones), self-hosted vLLM.
// In action with our clients
// See also
