Technique5 min2026-03-04

One Request at a Time: Why We Serialize GPU Inference

How a simple queue prevents Metal GPU crashes and why throughput isn't always the goal.

The Crash

We're running Qwen3-8B locally via mlx_lm.server on Apple Silicon. The supervision pipeline makes multiple LLM calls per evaluation cycle — RAG context retrieval, rule evaluation batches, summaries. What happens when two hit the GPU at once?

crash log
mlx_lm.server: processing request 1 (rule evaluation)...
mlx_lm.server: processing request 2 (RAG retrieval)...
 
libc++abi: terminating due to uncaught exception
SIGABRT in mlx::core::metal::CommandEncoder
*** Metal command buffer execution error ***
Process terminated.

SIGABRT. The Metal command buffer can't handle concurrent inference requests. This isn't a bug we can fix — it's a fundamental constraint of running large models on consumer GPU hardware.

The Fix

All LLM requests go through a single serialization queue inside LLMClient. Only one inference runs at a time. New requests wait in line.

packages/llm/src/client.ts
private queue: Array<{
  resolve: (value: LLMResponse) => void;
  reject: (reason: unknown) => void;
  messages: ChatMessage[];
  options?: ChatOptions;
}> = [];
private processing = false;

async chat(messages: ChatMessage[], options?: ChatOptions) {
  return new Promise<LLMResponse>((resolve, reject) => {
    this.queue.push({ resolve, reject, messages, options });
    this.processQueue();
  });
}

private async processQueue() {
  if (this.processing) return;
  this.processing = true;
  while (this.queue.length > 0) {
    const req = this.queue.shift()!;
    try {
      const result = await this.doInference(req.messages, req.options);
      req.resolve(result);
    } catch (err) {
      req.reject(err);
    }
  }
  this.processing = false;
}
serialized execution
→ Request 1 (RAG retrieval) queued
→ Request 2 (rule eval batch 1) queued
→ Request 3 (rule eval batch 2) queued
 
● Processing request 1... done (3.2s)
● Processing request 2... done (14.8s)
● Processing request 3... done (12.1s)
 
Total: 30.1s | Zero crashes.

The Tradeoff

Evaluation cycles take 20-40 seconds because calls are sequential. With parallelism, we could cut that to 15 seconds. But crashes are catastrophic — the LLM server dies, pending evaluations are lost, and restart takes 60+ seconds for model reloading.

ApproachCycle TimeCrash RiskRecovery
Parallel (2 concurrent)~15sHigh — SIGABRT60s restart
Serial queue~30sZeroN/A
Semaphore(2)~18sMedium60s restart

The principle

Throughput isn't always the goal. When your compute resource is a single consumer GPU, stability beats speed. Non-evaluation turns cost <1ms (just buffering), so the 30-second eval cycle only affects 1 in every 5 turns. Users barely notice.