Technique5 min2026-03-04

When the LLM Is Wrong, Code Wins

How hard overrides on LLM decisions keep the RAG tree balanced — and the broader principle of LLM-in-the-loop, not LLM-in-control.

The Yes-Man Problem

The RAG index builder asks the LLM: "should these 4 turns extend the current topic, start a new topic, or start a new phase?" The LLM almost always says extend_topic. Every time.

Why? Because consecutive turns in a coding session ARE related. The LLM sees coherence and thinks "same topic." You end up with one monolithic phase spanning 200 turns and one giant topic — useless for retrieval.

without overrides — 300-turn session
Phase 1: "Project Setup and Implementation" (turns 1-300)
└─ Topic: "Working on the project" (turns 1-300)
├─ Action: turns 1-4
├─ Action: turns 5-8
├─ ...75 more actions...
└─ Action: turns 297-300
 
↑ One phase, one topic, 75 actions. Useless for retrieval.

The Override

packages/rag/src/builder.ts
// LLM says "extend_topic" but the topic has 24 turns
if (llmDecision === 'extend_topic' && currentTopic.turns > 20) {
  decision = 'new_topic';  // Override: force a split
}

// LLM says "extend_topic" or "new_topic" but the phase has 85 turns
if (decision !== 'new_phase' && currentPhase.turns > 80) {
  decision = 'new_phase';  // Override: force a phase break
}
with overrides — same 300-turn session
Phase 1: "Authentication Implementation" (turns 1-95)
├─ Topic: "JWT Setup" (turns 1-20)
├─ Topic: "Auth Middleware" (turns 21-52)
└─ Topic: "Permission Guards" (turns 53-95)
Phase 2: "Testing & Debug" (turns 96-210)
├─ Topic: "Unit Tests" (turns 96-130)
└─ Topic: "Integration Test Debugging" (131-210)
Phase 3: "Deployment" (turns 211-300)
└─ Topic: "CI/CD Pipeline Setup" (211-300)
 
↑ 3 phases, 6 topics. Retrieval can pinpoint any area.

How the Thresholds Were Chosen

Empirical tuning, not theory. We ran the system against real 300+ turn sessions and observed the tree output:

Topic/Phase LimitResultProblem
topic=10, phase=50Too many splits"Writing auth.ts" and "Still writing auth.ts" as separate topics
topic=30, phase=100Too few splitsSingle topic covers write + test + debug + deploy
topic=20, phase=80Balanced3-5 phases, 3-8 topics per phase, good retrieval

Graceful Degradation

The same thresholds serve as the fallback when the LLM fails entirely:

packages/rag/src/builder.ts
// LLM failed twice — use heuristic fallback
if (phases.length === 0) return 'new_phase';       // No structure yet
if (currentPhase.turns > 80) return 'new_phase';   // Phase too long
if (currentTopic.turns > 20) return 'new_topic';   // Topic too long
return 'extend_topic';                              // Default: continue

LLM failure just means you lose custom titles and summaries — you keep correct tree structure. The structural invariants hold whether the LLM cooperates or not.

The Trust Boundary

Code OwnsLLM Owns
When to split (thresholds)What to name topics/phases
Tree structure (Phase > Topic > Action)Summary content
Batch size (4 turns)Relevance judgment
Evaluation cadenceViolation vs. pass decisions
Max explanations (100/session)Explanation text

The principle

LLM-in-the-loop, not LLM-in-control. Code defines structural invariants — the LLM adds richness. If you swapped Qwen3 for a completely different model, all structural guarantees would still hold. Only the judgment quality and prompt phrasing would need re-tuning. That's the boundary.