How Singularity Works: From JSONL to Intervention
A complete walkthrough of the supervision pipeline — how we turn raw session logs into real-time corrective action.
Singularity monitors your AI coding agent in real-time, detects when it goes off the rails, and pastes corrective prompts directly into your terminal. No cloud. No API keys. Everything runs locally on Apple Silicon.
But how do you build a reliable supervision system on top of an inherently non-deterministic LLM? The answer is a 5-layer pipeline where code handles structure and the LLM handles judgment.
The Pipeline at a Glance
Monitor
Poll JSONL files, parse turns, store in SQLite
RAG Index
Build Phase > Topic > Action tree every 4 turns
Evaluator
Compute metrics in code, judge with LLM every 5 turns
Alert
Classify severity, trigger notification + sound
Intervene
Compose prompt, find terminal, paste via AppleScript
Each layer has clear ownership. Code owns "how often" and "how much." The LLM owns "what" and "why." If the LLM fails at any point, the system degrades gracefully — never producing a false alarm from an infrastructure hiccup.
Layer 1: The Monitor
Claude Code stores session logs as JSONL files in ~/.claude/projects/. Every second, Singularity polls for new lines using a persisted cursor — it never re-reads old data. Each line becomes a structured Turn object with tools used, files changed, commands run, and thinking logs.
Why JSONL polling?
Claude Code writes session data as append-only JSONL. We track a cursor per file — the byte offset of the last read. Each poll picks up only new lines. This is cheaper than filesystem watchers and handles large sessions without memory pressure.
Layer 2: RAG Index Builder
Every 4 turns, the index builder asks the LLM: "where do these turns fit in the session structure?" The LLM returns one of three actions: extend the current topic, start a new topic, or start a new phase. The result is a hierarchical tree — no vector embeddings, no similarity search.
The LLM tends to always say "extend current topic" because consecutive turns are inherently related. So the code enforces hard structural limits: topics split after 20 turns, phases split after 80. This keeps the tree balanced for retrieval.
Layer 3: The Evaluator
Every 5 turns, the evaluator runs. Code computes deterministic metrics — command failure counts, progress rates, file edit frequency. The LLM then judges those numbers against 7 rules written in markdown.
The core principle
Never ask an LLM to count things — it'll get the number wrong. Code computes "pnpm test failed 4 times." The LLM decides "is that a debug loop or intentional retry?" This separation makes the system testable: you can unit test counting without an LLM, and evaluate LLM judgment with fixed inputs.
Layer 4: Alert & Notify
When the evaluator returns a violation, Singularity creates an alert, triggers a macOS notification with a severity-tiered sound (Tink for medium, Glass for high, Hero for critical), and batches rapid alerts within a 3-second window so you don't get notification-bombed.
Layer 5: Terminal Intervention
The intervention manager composes a corrective prompt tailored to the classification — debug loops get "try a different approach," plan drift gets "re-read the original task." It finds your running terminal (supports 8 emulators), activates it via AppleScript, pastes the prompt, and presses Return.
The LLM Layer
All intelligence runs on Qwen3-8B (4-bit quantized) via mlx_lm.server on Apple Silicon. About 5GB of VRAM. All requests go through a serial inference queue — one at a time — to prevent Metal GPU crashes. Seven key engineering techniques make this reliable:
| Technique | What It Solves |
|---|---|
| Serial inference queue | Prevents GPU crashes from concurrent Metal commands |
| Prompt-based JSON + 3-tier retry | Reliable structured output without constrained decoding |
| Code counts, LLM judges | Deterministic metrics + qualitative judgment |
| Selective /think vs /no_think | Chain-of-thought only where task complexity justifies it |
| Hard overrides on LLM decisions | Code enforces structural invariants (RAG tree balance) |
| Fail-safe defaults | Infrastructure failures → pass, never false alarm |
| Fuzzy output matching | Tolerant parsing of approximate LLM compliance |
Each of these techniques has its own deep-dive post. The common thread: treat the LLM as a powerful but unreliable collaborator. Give it clear, bounded questions. Validate its answers. Have a fallback for when it fails. And never let it count.
The Result
On a 300-turn evaluation session: 56 alerts, 0 false positives, 100% recall. The system correctly identified every debug loop, plan drift, and stalled progress — while never crying wolf. Evaluation cycles take 20-40 seconds on an M-series Mac. Non-evaluation turns cost less than 1 millisecond.
Deterministic from non-deterministic
The whole point of this architecture is to produce deterministic supervision behavior from a non-deterministic model. Code defines when to evaluate, what to measure, and how to structure the session. The LLM adds richness — titles, summaries, nuanced judgment — but the structural guarantees hold even if you swap the model tomorrow.