Technique6 min2026-03-04

Code Counts, LLM Judges

Why we never let the LLM count to three — and how splitting determinism from intelligence makes the system testable.

LLMs Can't Count

Ask an LLM "how many times did pnpm test appear in these 30 turns" and you'll get a wrong answer. It might say 3 when the real number is 7. It might miss failures that happened 20 turns ago. LLMs process text — they don't maintain running tallies.

So we split the work: code handles everything that can be counted, the LLM handles everything that requires judgment.

What Code Computes

buildStructuredTurnData() output
── Deterministic Metrics ─────────────────────────
 
Top commands (full session):
pnpm test — 12x
pnpm build — 4x
git add — 3x
 
Top commands (recent 30 turns):
pnpm test — 8x (4 failures)
git diff — 2x
 
File edits:
src/auth.ts — 6x
src/jwt.ts — 3x
 
Progress rate: 0.72 (full) | 0.23 (recent)
Command diversity: 0.3 (unique/total in window)
User corrections: 2 ("wrong", "try again")

What the LLM Judges

The LLM gets these numbers alongside the rule definitions (with explicit thresholds) and the RAG context. It makes a qualitative call:

LLM evaluation
── LLM Input ────────────────────────────────────
Rule: debug-loop
Threshold: command failing 3+ times = violation
Data: pnpm test failed 4x in recent 30 turns
Context: Agent is editing auth.ts between each retry
 
── LLM Output ───────────────────────────────────
verdict: "violation"
confidence: 0.85
reason: "Test has failed 4 times with the same
assertion error. Agent is editing the
implementation but not the test setup."

The Progress Metric

One of the most important computed metrics is progress rate — the fraction of assistant turns that produced "meaningful activity." But what counts as meaningful?

packages/supervisor/src/supervisor.ts
const investigationTools = ['Read', 'Grep', 'Glob', 'Bash', 'Edit', 'Write'];

// A turn has progress if it ran commands, edited files, or used investigation tools
const hasProgress = turn.commands.length > 0
  || turn.filesChanged.length > 0
  || turn.toolsUsed.some(t => investigationTools.includes(t));

// Text-only turns do NOT count as progress
// This prevents the LLM from counting verbose explanations as "work"

The two-window approach (full session + recent 30 turns) catches both chronic and acute problems. An agent that was productive for 120 turns then stalls for 30 has a healthy full-session rate (0.70) but a terrible recent rate (0.10). Without the recent window, you'd miss the stall entirely.

Why This Makes the System Testable

The separation creates two independent test surfaces:

LayerTest MethodLLM Required?
Metric computationUnit tests with fixed Turn dataNo
LLM judgmentEval suite with fixed metrics + gold-standard verdictsYes
End-to-endFull session replay with expected alert outcomesYes

The principle

If you can write a deterministic function for it, don't send it to the LLM. Reserve the LLM for genuine judgment calls where context and nuance matter. Your system will be faster, cheaper, more testable, and more reliable.