Technique4 min2026-03-04

Never Cry Wolf: Why Failures Default to Pass

Every LLM failure in the pipeline produces a "pass" verdict. Here's why false negatives are cheaper than false positives in a supervision tool.

The Error Budget Question

When the LLM times out, returns garbage, or the server crashes mid-inference — what should the evaluator do? There are only two options: assume something is wrong (violation), or assume everything is fine (pass).

We choose pass. Every time. Here's why.

The Cost Asymmetry

Failure ModeCostRecovery
False positive (cry wolf)User gets alert on nothing. Trust erodes. Tool gets disabled.None — trust is hard to rebuild
False negative (missed alert)Real problem goes undetected for one cycle (5 turns).Next evaluation cycle catches it

A false negative costs ~30 seconds of delay. A false positive costs your relationship with the user. In a monitoring tool, credibility is everything.

The Pattern Across the System

fail-safe behaviors
Component On LLM Failure...
─────────────────────────────────────────────────
Evaluator → pass (confidence: 0.3)
RAG builder → heuristic fallback (threshold splits)
Plan detection → no change detected
Title generation → keep existing title
Explainer → skip silently
Context retrieval → return last 2 phases
Summary generation → skip this cycle

The Confidence Signal

Notice the fail-safe returns confidence: 0.3, not 0.9. This is deliberate:

packages/supervisor/src/evaluator.ts
try {
  const result = await this.llm.chatJSON<EvalResponse[]>(messages, {
    temperature: 0.6,
  });
  return this.parseEvalResults(result, rules);
} catch (error) {
  // Never produce a false positive from infrastructure failure
  return rules.map(rule => ({
    ruleId: rule.id,
    verdict: 'pass' as const,
    confidence: 0.3,  // Low confidence signals "I didn't really evaluate this"
    reason: 'Evaluation skipped due to inference error',
  }));
}

Downstream components can distinguish between a genuine "I evaluated this and it's fine" (confidence: 0.85) and a "the LLM was down so I defaulted" (confidence: 0.3). If you wanted to build a system health dashboard, low-confidence passes would flag inference reliability issues.

When You'd Flip This

This error budget only works for monitoring tools. In other domains, the calculus flips:

DomainOn Failure, Default To...Why
AI agent supervisionPass (safe)User trust > delayed detection
Medical diagnosisFlag for reviewMissed diagnosis > false alarm
Financial fraudBlock transactionFraud loss > false decline
Content moderationFlag for reviewHarmful content > over-moderation

The principle

Choose your failure mode before writing a single line of code. Ask: which is cheaper — a false positive or a false negative? Then make every catch block in the system enforce that choice consistently. In Singularity, the answer is unambiguous: never cry wolf.