Technique6 min2026-03-04

No JSON Mode? No Problem.

How we get reliable structured output from local models without constrained decoding — and why the fallback sampler is the real hero.

The Problem with json_object Mode

Most LLM APIs offer response_format: json_object — constrained decoding that only emits valid JSON tokens. Sounds perfect. Except on local models like Qwen, it causes severe slowdowns or complete hangs. The model fights its own probability distribution to satisfy token-level constraints.

with constrained decoding
→ POST /v1/chat/completions (json_object mode)
Generating tokens: 1... 2... 3...
...
[120 seconds elapsed — no response]
Connection timeout.

The Three-Tier Retry Strategy

Instead of constraining the model, we trust it. The system prompt says "respond with JSON only," and we parse the result. When parsing fails, a carefully designed retry strategy recovers.

packages/llm/src/client.ts
async chatJSON<T>(messages: ChatMessage[], options?: ChatOptions): Promise<T> {
  // Tier 1: Normal attempt
  const response = await this.chat(messages, options);
  const cleaned = this.cleanOutput(response.content);
  try { return JSON.parse(cleaned); } catch {}

  // Tier 2: Retry (handles transient formatting glitches)
  const retry = await this.chat(messages, options);
  const cleanedRetry = this.cleanOutput(retry.content);
  try { return JSON.parse(cleanedRetry); } catch {}

  // Tier 3: Fallback sampler — nudge temperature to escape loops
  const currentTemp = options?.temperature ?? 0.7;
  const nudged = { ...options, temperature: Math.min(1.0, currentTemp + 0.1) };
  const fallback = await this.chat(messages, nudged);
  return JSON.parse(this.cleanOutput(fallback.content));
}

Why the Temperature Nudge Works

If the model fails twice at temperature 0.6, it's likely stuck in a degenerate loop — producing the same malformed token sequence each time. Bumping to 0.7 slightly flattens the probability distribution, giving it just enough randomness to pick a different path.

the nudge in action
Attempt 1 (temp=0.6):
{"ruleId": "debug-loop", "verdict": "viol ← truncated
 
Attempt 2 (temp=0.6):
{"ruleId": "debug-loop", "verdict": "viol ← same breakpoint
 
Attempt 3 (temp=0.7 — nudged +0.1):
{"ruleId": "debug-loop", "verdict": "violation", "confidence": 0.85}
✓ Parsed successfully

The Cleaning Pipeline

Before JSON parsing, every response runs through three cleanup steps that handle known model quirks:

packages/llm/src/client.ts
private cleanOutput(content: string): string {
  return content
    // Qwen3 leaks reasoning tokens
    .replace(/<think>[\s\S]*?<\/think>\s*/g, '')
    // Qwen3 leaks end-of-turn tokens
    .replace(/<\|im_end\|>/g, '')
    // Models wrap JSON in markdown fences
    .replace(/^```(?:json)?\s*\n?([\s\S]*?)\n?\s*```$/g, '$1')
    .trim();
}

The principle

Cloud APIs (OpenAI, Anthropic) handle constrained decoding well because their infrastructure is optimized for it. Local models on consumer hardware? You're better off trusting the model + building resilient parsing. The cleaning pipeline handles 95% of cases. The retry handles 4.9%. The temperature nudge handles the last 0.1%.