No JSON Mode? No Problem.
How we get reliable structured output from local models without constrained decoding — and why the fallback sampler is the real hero.
The Problem with json_object Mode
Most LLM APIs offer response_format: json_object — constrained decoding that only emits valid JSON tokens. Sounds perfect. Except on local models like Qwen, it causes severe slowdowns or complete hangs. The model fights its own probability distribution to satisfy token-level constraints.
The Three-Tier Retry Strategy
Instead of constraining the model, we trust it. The system prompt says "respond with JSON only," and we parse the result. When parsing fails, a carefully designed retry strategy recovers.
async chatJSON<T>(messages: ChatMessage[], options?: ChatOptions): Promise<T> {
// Tier 1: Normal attempt
const response = await this.chat(messages, options);
const cleaned = this.cleanOutput(response.content);
try { return JSON.parse(cleaned); } catch {}
// Tier 2: Retry (handles transient formatting glitches)
const retry = await this.chat(messages, options);
const cleanedRetry = this.cleanOutput(retry.content);
try { return JSON.parse(cleanedRetry); } catch {}
// Tier 3: Fallback sampler — nudge temperature to escape loops
const currentTemp = options?.temperature ?? 0.7;
const nudged = { ...options, temperature: Math.min(1.0, currentTemp + 0.1) };
const fallback = await this.chat(messages, nudged);
return JSON.parse(this.cleanOutput(fallback.content));
}Why the Temperature Nudge Works
If the model fails twice at temperature 0.6, it's likely stuck in a degenerate loop — producing the same malformed token sequence each time. Bumping to 0.7 slightly flattens the probability distribution, giving it just enough randomness to pick a different path.
The Cleaning Pipeline
Before JSON parsing, every response runs through three cleanup steps that handle known model quirks:
private cleanOutput(content: string): string {
return content
// Qwen3 leaks reasoning tokens
.replace(/<think>[\s\S]*?<\/think>\s*/g, '')
// Qwen3 leaks end-of-turn tokens
.replace(/<\|im_end\|>/g, '')
// Models wrap JSON in markdown fences
.replace(/^```(?:json)?\s*\n?([\s\S]*?)\n?\s*```$/g, '$1')
.trim();
}The principle
Cloud APIs (OpenAI, Anthropic) handle constrained decoding well because their infrastructure is optimized for it. Local models on consumer hardware? You're better off trusting the model + building resilient parsing. The cleaning pipeline handles 95% of cases. The retry handles 4.9%. The temperature nudge handles the last 0.1%.