Technique5 min2026-03-04

Think When It Matters: Budgeting Chain-of-Thought

Why only one task type in our pipeline gets reasoning tokens — and how /no_think cuts inference time by 80%.

The Cost of Thinking

Qwen3 has a built-in chain-of-thought mode. When it "thinks," it emits reasoning tokens inside <think> tags before the actual answer. This improves accuracy on complex tasks — but it costs tokens, time, and GPU cycles.

with /think (rule evaluation)

<think>

The user wants me to evaluate if this is a debug loop.

pnpm test failed 4 times in the recent window.

The threshold is 3+ failures = violation.

But I should check — did the user explicitly ask to retry?

Looking at the context... no user instruction to retry.

The agent is editing auth.ts between retries but not the

test setup, which is where the actual failure is.

This is a genuine debug loop.

</think>

{"verdict": "violation", "confidence": 0.85}

~800 tokens | ~15 seconds

with /no_think (RAG classification)

{"action": "new_topic", "title": "Test Debugging"}

~30 tokens | ~3 seconds

The Task Taxonomy

Every LLM call in the pipeline is categorized. Only one task type gets chain-of-thought:

Task	Mode	Temperature	Why
Rule evaluation	/think	0.6	Must weigh competing signals and exceptions
RAG index building	/no_think	0.7	Simple 3-way classification
RAG context retrieval	/no_think	0.7	Pick relevant indices from a list
Plan change detection	/no_think	0.7	Binary yes/no decision
Explainer generation	/no_think	0.7	Creative writing, not reasoning
Summary generation	/no_think	0.7	Condensation, not analysis
Title generation	/no_think	0.3	Short factual extraction

Why Rule Evaluation Needs Reasoning

Rule evaluation is the only task where the LLM must weigh competing signals. Consider this scenario:

a nuanced case

Data: pnpm test failed 4x in recent window

Rule: debug-loop threshold is 3+ failures

But also:

- User said "keep trying, I think it's a flaky test"

- Each retry has a different error message

- Agent changed strategy after retry 3

Without /think: "violation" (confidence: 0.80) ← wrong

With /think: "pass" (confidence: 0.75) ← correct

The reasoning step lets the model consider exceptions and context before making a judgment. Without it, the model just pattern-matches numbers against thresholds — which code could do without an LLM at all.

The Performance Impact

A typical evaluation cycle makes 4-5 LLM calls. With selective thinking:

Call	/think everywhere	Selective /think
RAG retrieval	15s	3s
Rule eval batch 1	15s	15s
Rule eval batch 2	15s	15s
Summary	15s	3s
Total	~60s	~36s

→

The principle

Chain-of-thought isn't free. It costs 5x the tokens and 5x the time. Use it surgically — only where the task complexity justifies the cost. Know your task taxonomy: classification and extraction don't need reasoning. Nuanced multi-factor judgment does.