Blog
Product updates, engineering deep dives, and practical guides from the Casola team.
engineering infrastructure
One inference platform, four API surfaces
How OpenAI-, Anthropic-, and Fal.ai-compatible clients share the same dispatch backend with Casola's native API, and where they can't
Casola Team
engineering infrastructure
Every inference request comes with a compliance certificate
Verifiable data residency built into every request, without dedicated infrastructure
Casola Team
engineering infrastructure
Building a GPU autoscaler that works: queueing theory and utilization metrics combined
Why utilization alone is the wrong scaling signal for GPU inference, and how arrival rate, Little's Law, and queue drain work better
Casola Team
engineering infrastructure
Where the milliseconds go in a GPU inference request
End-to-end latency decomposition across a multi-modal inference pipeline — and the five decisions that keep overhead off the critical path
Casola Team
engineering infrastructure
GPU workers fail in interesting ways
From PCIe bus failures to cascading cloud outages: what actually breaks in a distributed GPU inference fleet, and how you build around it
Casola Team