A network link goes down. You have failover configured. The backup path comes up. From your monitoring probes, everything looks fine.
For the other half of the internet, you don’t exist. Packets are going into a black hole and nobody’s told you yet.
Just a normal day in infra land.
Casola provides managed inference infrastructure for AI agents and applications. At its core, it routes requests to regional GPU workers, handles scheduling and capacity, and makes sure that when something breaks in the underlying compute layer — and something always eventually breaks — the failure is handled before it becomes your problem.
This is a post about what actually breaks and how we think about it.
The failure taxonomy
Crashes are easy. OOM is the failure people design for first: a large model, a long context, a batch that tips the memory budget. Clean, detectable, recoverable. Or the fire sprinklers flood the datacenter. Obvious, if painful.
Then come the silent failures. GPU driver instability, engine hangs, a GPU falling off the PCIe bus mid-inference. The process is still alive. The connection might still be open. Nothing has explicitly failed. But no output is coming.
The real tricky ones are intermittent. Heisenbugs. InfiniBand connections that become unstable between nodes in a multi-GPU setup, then recover. Network outages that last fractions of a minute, just long enough to break a job, short enough to look like noise. The connection state says one thing; the actual behavior says another.
Every one of these surfaces differently and at a different point in the job lifecycle — before dispatch, during generation, after generation completes but before results are written.
Why connection state isn’t enough
The instinct is to watch the connection. If it drops, the job failed. Re-queue it.
The problem: an open connection is not a signal that work is progressing. A frozen process doesn’t close its sockets. TCP keepalive detection is slow. A worker can appear fully connected while producing nothing.
So we don’t rely on connection state for correctness. Every job is timeboxed and observed individually.
Each job gets a lease when it’s dispatched to a worker. The worker renews that lease while actively processing. If renewal stops for any reason — crash, hang, silent failure — the clock runs out, and the job re-queues automatically. The queue doesn’t wait to hear from the connection.
job dispatched → lease set (e.g. 30s)
worker running → lease renewed every N seconds
worker silent → lease expires → job re-queued with new fence token

The fence token is the other half of this. When a job re-queues, it gets a new token. If a worker recovers from a transient failure and tries to write a result for the old token, the write is rejected. This matters more than it sounds.
When the result shows up anyway
Here’s a version that’s harder to handle than a clean crash: everything looks fine at the queue, everything looks fine at the worker, but the storage layer that holds results takes a nap.
The worker completes inference. Writes the result. The write appears to succeed locally. But upstream, the blob store is lagging or partitioned, and the result hasn’t actually propagated. The job times out from the queue’s perspective, re-queues, and a new worker picks it up.
Then the blob store wakes up and suddenly delivers the original result — along with a hundred others that accumulated while it was degraded.
The fence token handles this too. The re-queued job has a new token. When the stale result arrives, it doesn’t match. The queue ignores it and serves the fresh result instead.
Network partitions are the worst day
We’ve seen an AWS outage cascade into a GCP degradation that cascaded into broader connectivity issues across multiple providers — none of them fully down, but the routing between them unreliable enough to make each look broken depending on where you were standing.
The key property of a network partition: just because you can reach a worker from your monitoring box doesn’t mean the remaining 99% of the internet can. Your checks pass. Your dashboards are green. Your customers can’t reach you.
This is why infra engineers have a job.
We run across multiple regions specifically for this reason. Queue state is regional. Workers are regional. When a region has a bad day, traffic routes around it. Jobs that were in-flight in the affected region time out on their leases, re-queue in a healthy region, and get dispatched to workers that are actually reachable. Metadata is replicated so regional failures don’t take down the global view.
We also route through Cloudflare to reduce the chance of arbitrary internet routing putting traffic into a black hole. Traffic that enters the network gets handed off over Cloudflare’s backbone rather than traversing unpredictable third-party paths.
Retry storms
The retry mechanism has its own failure mode.
A request that’s fundamentally broken — a prompt the model can’t handle, an input that always causes a hang — gets retried. Each retry occupies a worker. Workers serving broken requests are slower than normal. That slowness starts to affect other requests in the queue. Those requests begin to see higher latency, some time out, and they start retrying too.
This is self-reinforcing in a way that compounds fast. Round one: one bad request occupies a worker, throughput drops a little. Round two: the slowdown causes two healthy requests to time out — now three requests are retrying. Round three: six more time out. Each cycle degrades the system further, which causes more timeouts, which produces more retries. A single broken input can trigger a feedback loop that amplifies into system-wide degradation within minutes.
Clients often don’t observe the retries at all — they just see increased latency until the bad requests exhaust their retry budget and get dead-lettered. From inside the system, retry rate is a leading indicator that something is wrong upstream.
Handling retries gracefully is a whole story by itself. Relying on clients to back off gracefully usually doesn’t work out: different clients, different SDK versions, different retry logic. Instead, our routing layer pays special attention to request age. Past the 60-second mark, fine-grained rate limiting and request deduplication kick in to prevent a slow system from being buried by its own retry traffic.
Tuning the thresholds
The lease duration and retry backoff aren’t fixed constants — they’re calibrated to the workload.
A short lease catches silent failures faster but risks re-queuing jobs that are still running, throwing away real work on long inferences. A long lease minimizes wasted work but means a crashed worker blocks capacity for longer.
The right number depends on the shape of your jobs. For models where most requests complete in seconds, a 30-second lease is conservative. For long video generation or large batch jobs that run for several minutes, you’d want a longer lease or a more frequent renewal signal.
The retry backoff follows the same logic. Exponential backoff — starting short and capping at a ceiling — reduces the self-reinforcing retry storm. But if the ceiling is too high, recovery from transient failures is slower than it needs to be.
Getting these numbers wrong in either direction shows up in the metrics. Lease expiry rate climbing means workers are slower than the lease allows. Dead-letter rate climbing means retries are being exhausted without recovery.
What you see as a caller
For transient failures, nothing. The job re-queues, a healthy worker picks it up, and inference completes. The latency may be slightly higher if a retry took time to dispatch. The API response is identical to a request that succeeded on the first try.
For failures that exhaust retries, a clear status. The job is dead-lettered, the error is recorded, and the status endpoint returns it with enough context to understand what happened.
The goal is that infrastructure noise is invisible. Most of the time it is. When it isn’t, the failure surfaces as a deterministic, debuggable status — not a silent hang or a mystery timeout.
That’s the contract we’re trying to hold.