One Request at a Time: Why We Serialize GPU Inference
How a simple queue prevents Metal GPU crashes and why throughput isn't always the goal.
The Crash
We're running Qwen3-8B locally via mlx_lm.server on Apple Silicon. The supervision pipeline makes multiple LLM calls per evaluation cycle — RAG context retrieval, rule evaluation batches, summaries. What happens when two hit the GPU at once?
SIGABRT. The Metal command buffer can't handle concurrent inference requests. This isn't a bug we can fix — it's a fundamental constraint of running large models on consumer GPU hardware.
The Fix
All LLM requests go through a single serialization queue inside LLMClient. Only one inference runs at a time. New requests wait in line.
private queue: Array<{
resolve: (value: LLMResponse) => void;
reject: (reason: unknown) => void;
messages: ChatMessage[];
options?: ChatOptions;
}> = [];
private processing = false;
async chat(messages: ChatMessage[], options?: ChatOptions) {
return new Promise<LLMResponse>((resolve, reject) => {
this.queue.push({ resolve, reject, messages, options });
this.processQueue();
});
}
private async processQueue() {
if (this.processing) return;
this.processing = true;
while (this.queue.length > 0) {
const req = this.queue.shift()!;
try {
const result = await this.doInference(req.messages, req.options);
req.resolve(result);
} catch (err) {
req.reject(err);
}
}
this.processing = false;
}The Tradeoff
Evaluation cycles take 20-40 seconds because calls are sequential. With parallelism, we could cut that to 15 seconds. But crashes are catastrophic — the LLM server dies, pending evaluations are lost, and restart takes 60+ seconds for model reloading.
| Approach | Cycle Time | Crash Risk | Recovery |
|---|---|---|---|
| Parallel (2 concurrent) | ~15s | High — SIGABRT | 60s restart |
| Serial queue | ~30s | Zero | N/A |
| Semaphore(2) | ~18s | Medium | 60s restart |
The principle
Throughput isn't always the goal. When your compute resource is a single consumer GPU, stability beats speed. Non-evaluation turns cost <1ms (just buffering), so the 30-second eval cycle only affects 1 in every 5 turns. Users barely notice.