Think When It Matters: Budgeting Chain-of-Thought
Why only one task type in our pipeline gets reasoning tokens — and how /no_think cuts inference time by 80%.
The Cost of Thinking
Qwen3 has a built-in chain-of-thought mode. When it "thinks," it emits reasoning tokens inside <think> tags before the actual answer. This improves accuracy on complex tasks — but it costs tokens, time, and GPU cycles.
The Task Taxonomy
Every LLM call in the pipeline is categorized. Only one task type gets chain-of-thought:
| Task | Mode | Temperature | Why |
|---|---|---|---|
| Rule evaluation | /think | 0.6 | Must weigh competing signals and exceptions |
| RAG index building | /no_think | 0.7 | Simple 3-way classification |
| RAG context retrieval | /no_think | 0.7 | Pick relevant indices from a list |
| Plan change detection | /no_think | 0.7 | Binary yes/no decision |
| Explainer generation | /no_think | 0.7 | Creative writing, not reasoning |
| Summary generation | /no_think | 0.7 | Condensation, not analysis |
| Title generation | /no_think | 0.3 | Short factual extraction |
Why Rule Evaluation Needs Reasoning
Rule evaluation is the only task where the LLM must weigh competing signals. Consider this scenario:
The reasoning step lets the model consider exceptions and context before making a judgment. Without it, the model just pattern-matches numbers against thresholds — which code could do without an LLM at all.
The Performance Impact
A typical evaluation cycle makes 4-5 LLM calls. With selective thinking:
| Call | /think everywhere | Selective /think |
|---|---|---|
| RAG retrieval | 15s | 3s |
| Rule eval batch 1 | 15s | 15s |
| Rule eval batch 2 | 15s | 15s |
| Summary | 15s | 3s |
| Total | ~60s | ~36s |
The principle
Chain-of-thought isn't free. It costs 5x the tokens and 5x the time. Use it surgically — only where the task complexity justifies the cost. Know your task taxonomy: classification and extraction don't need reasoning. Nuanced multi-factor judgment does.