Technique5 min2026-03-04

Think When It Matters: Budgeting Chain-of-Thought

Why only one task type in our pipeline gets reasoning tokens — and how /no_think cuts inference time by 80%.

The Cost of Thinking

Qwen3 has a built-in chain-of-thought mode. When it "thinks," it emits reasoning tokens inside <think> tags before the actual answer. This improves accuracy on complex tasks — but it costs tokens, time, and GPU cycles.

with /think (rule evaluation)
<think>
The user wants me to evaluate if this is a debug loop.
pnpm test failed 4 times in the recent window.
The threshold is 3+ failures = violation.
But I should check — did the user explicitly ask to retry?
Looking at the context... no user instruction to retry.
The agent is editing auth.ts between retries but not the
test setup, which is where the actual failure is.
This is a genuine debug loop.
</think>
{"verdict": "violation", "confidence": 0.85}
 
~800 tokens | ~15 seconds
with /no_think (RAG classification)
{"action": "new_topic", "title": "Test Debugging"}
 
~30 tokens | ~3 seconds

The Task Taxonomy

Every LLM call in the pipeline is categorized. Only one task type gets chain-of-thought:

TaskModeTemperatureWhy
Rule evaluation/think0.6Must weigh competing signals and exceptions
RAG index building/no_think0.7Simple 3-way classification
RAG context retrieval/no_think0.7Pick relevant indices from a list
Plan change detection/no_think0.7Binary yes/no decision
Explainer generation/no_think0.7Creative writing, not reasoning
Summary generation/no_think0.7Condensation, not analysis
Title generation/no_think0.3Short factual extraction

Why Rule Evaluation Needs Reasoning

Rule evaluation is the only task where the LLM must weigh competing signals. Consider this scenario:

a nuanced case
Data: pnpm test failed 4x in recent window
Rule: debug-loop threshold is 3+ failures
 
But also:
- User said "keep trying, I think it's a flaky test"
- Each retry has a different error message
- Agent changed strategy after retry 3
 
Without /think: "violation" (confidence: 0.80) ← wrong
With /think: "pass" (confidence: 0.75) ← correct

The reasoning step lets the model consider exceptions and context before making a judgment. Without it, the model just pattern-matches numbers against thresholds — which code could do without an LLM at all.

The Performance Impact

A typical evaluation cycle makes 4-5 LLM calls. With selective thinking:

Call/think everywhereSelective /think
RAG retrieval15s3s
Rule eval batch 115s15s
Rule eval batch 215s15s
Summary15s3s
Total~60s~36s

The principle

Chain-of-thought isn't free. It costs 5x the tokens and 5x the time. Use it surgically — only where the task complexity justifies the cost. Know your task taxonomy: classification and extraction don't need reasoning. Nuanced multi-factor judgment does.