Code Counts, LLM Judges
Why we never let the LLM count to three — and how splitting determinism from intelligence makes the system testable.
LLMs Can't Count
Ask an LLM "how many times did pnpm test appear in these 30 turns" and you'll get a wrong answer. It might say 3 when the real number is 7. It might miss failures that happened 20 turns ago. LLMs process text — they don't maintain running tallies.
So we split the work: code handles everything that can be counted, the LLM handles everything that requires judgment.
What Code Computes
What the LLM Judges
The LLM gets these numbers alongside the rule definitions (with explicit thresholds) and the RAG context. It makes a qualitative call:
The Progress Metric
One of the most important computed metrics is progress rate — the fraction of assistant turns that produced "meaningful activity." But what counts as meaningful?
const investigationTools = ['Read', 'Grep', 'Glob', 'Bash', 'Edit', 'Write'];
// A turn has progress if it ran commands, edited files, or used investigation tools
const hasProgress = turn.commands.length > 0
|| turn.filesChanged.length > 0
|| turn.toolsUsed.some(t => investigationTools.includes(t));
// Text-only turns do NOT count as progress
// This prevents the LLM from counting verbose explanations as "work"The two-window approach (full session + recent 30 turns) catches both chronic and acute problems. An agent that was productive for 120 turns then stalls for 30 has a healthy full-session rate (0.70) but a terrible recent rate (0.10). Without the recent window, you'd miss the stall entirely.
Why This Makes the System Testable
The separation creates two independent test surfaces:
| Layer | Test Method | LLM Required? |
|---|---|---|
| Metric computation | Unit tests with fixed Turn data | No |
| LLM judgment | Eval suite with fixed metrics + gold-standard verdicts | Yes |
| End-to-end | Full session replay with expected alert outcomes | Yes |
The principle
If you can write a deterministic function for it, don't send it to the LLM. Reserve the LLM for genuine judgment calls where context and nuance matter. Your system will be faster, cheaper, more testable, and more reliable.