Architecture12 min2026-03-04

How Singularity Works: From JSONL to Intervention

A complete walkthrough of the supervision pipeline — how we turn raw session logs into real-time corrective action.

Singularity monitors your AI coding agent in real-time, detects when it goes off the rails, and pastes corrective prompts directly into your terminal. No cloud. No API keys. Everything runs locally on Apple Silicon.

But how do you build a reliable supervision system on top of an inherently non-deterministic LLM? The answer is a 5-layer pipeline where code handles structure and the LLM handles judgment.

The Pipeline at a Glance

Monitor

Poll JSONL files, parse turns, store in SQLite

→

↓

RAG Index

Build Phase > Topic > Action tree every 4 turns

→

↓

Evaluator

Compute metrics in code, judge with LLM every 5 turns

→

↓

Alert

Classify severity, trigger notification + sound

→

↓

Intervene

Compose prompt, find terminal, paste via AppleScript

Each layer has clear ownership. Code owns "how often" and "how much." The LLM owns "what" and "why." If the LLM fails at any point, the system degrades gracefully — never producing a false alarm from an infrastructure hiccup.

Layer 1: The Monitor

Claude Code stores session logs as JSONL files in ~/.claude/projects/. Every second, Singularity polls for new lines using a persisted cursor — it never re-reads old data. Each line becomes a structured Turn object with tools used, files changed, commands run, and thinking logs.

singularity — monitor

● Polling ~/.claude/projects/...

● Session detected: implement-auth-system

Turn 1 | role: user | "Add JWT authentication to the API"

Turn 2 | role: assistant | tools: [Read, Grep, Glob]

Turn 3 | role: assistant | tools: [Edit, Write] files: [auth.ts, jwt.ts]

Turn 4 | role: assistant | tools: [Bash] cmd: "pnpm test" exit: 1

Turn 5 | role: assistant | tools: [Bash] cmd: "pnpm test" exit: 1

Turn 6 | role: assistant | tools: [Bash] cmd: "pnpm test" exit: 1

✨

Why JSONL polling?

Claude Code writes session data as append-only JSONL. We track a cursor per file — the byte offset of the last read. Each poll picks up only new lines. This is cheaper than filesystem watchers and handles large sessions without memory pressure.

Layer 2: RAG Index Builder

Every 4 turns, the index builder asks the LLM: "where do these turns fit in the session structure?" The LLM returns one of three actions: extend the current topic, start a new topic, or start a new phase. The result is a hierarchical tree — no vector embeddings, no similarity search.

session index — 300 turns

Phase 1: Authentication Implementation (turns 1-95)

├─ Topic: JWT Setup (turns 1-20)

│ ├─ Action: "Set up JWT signing and verification" (1-4)

│ ├─ Action: "Added token refresh logic" (5-8)

│ └─ Action: "Middleware integration" (9-20)

├─ Topic: Auth Middleware (turns 21-52)

└─ Topic: Permission Guards (turns 53-95)

Phase 2: Testing & Debug (turns 96-210)

├─ Topic: Unit Tests (turns 96-130)

└─ Topic: Integration Test Debugging (turns 131-210)

Phase 3: Deployment (turns 211-300)

└─ Topic: CI/CD Pipeline Setup (turns 211-300)

The LLM tends to always say "extend current topic" because consecutive turns are inherently related. So the code enforces hard structural limits: topics split after 20 turns, phases split after 80. This keeps the tree balanced for retrieval.

Layer 3: The Evaluator

Every 5 turns, the evaluator runs. Code computes deterministic metrics — command failure counts, progress rates, file edit frequency. The LLM then judges those numbers against 7 rules written in markdown.

structured turn data → LLM

── Code computes (deterministic) ──────────────────

Command failures: pnpm test (4x), pnpm build (1x)

Progress rate: 0.72 (full) | 0.23 (recent 30 turns)

File edits: auth.ts (6x), jwt.ts (3x)

Command diversity: 0.3 (low — circular activity)

User corrections: "wrong", "try again" (2x)

── LLM judges (qualitative) ──────────────────────

rule-debug-loop: VIOLATION (confidence: 0.85)

rule-stalled-progress: WARNING (confidence: 0.72)

rule-plan-drift: PASS (confidence: 0.91)

→

The core principle

Never ask an LLM to count things — it'll get the number wrong. Code computes "pnpm test failed 4 times." The LLM decides "is that a debug loop or intentional retry?" This separation makes the system testable: you can unit test counting without an LLM, and evaluate LLM judgment with fixed inputs.

Layer 4: Alert & Notify

When the evaluator returns a violation, Singularity creates an alert, triggers a macOS notification with a severity-tiered sound (Tink for medium, Glass for high, Hero for critical), and batches rapid alerts within a 3-second window so you don't get notification-bombed.

Layer 5: Terminal Intervention

The intervention manager composes a corrective prompt tailored to the classification — debug loops get "try a different approach," plan drift gets "re-read the original task." It finds your running terminal (supports 8 emulators), activates it via AppleScript, pastes the prompt, and presses Return.

terminal — intervention

$ claude

...agent is retrying pnpm test for the 8th time...

[SINGULARITY INTERVENTION]

You have been retrying the same failing test 8 times without

changing your approach. The test expects a mock provider but

you're hitting the real database.

Try: Create a mock auth provider in __tests__/mocks/ that

returns predetermined tokens instead of calling the JWT library.

The LLM Layer

All intelligence runs on Qwen3-8B (4-bit quantized) via mlx_lm.server on Apple Silicon. About 5GB of VRAM. All requests go through a serial inference queue — one at a time — to prevent Metal GPU crashes. Seven key engineering techniques make this reliable:

Technique	What It Solves
Serial inference queue	Prevents GPU crashes from concurrent Metal commands
Prompt-based JSON + 3-tier retry	Reliable structured output without constrained decoding
Code counts, LLM judges	Deterministic metrics + qualitative judgment
Selective /think vs /no_think	Chain-of-thought only where task complexity justifies it
Hard overrides on LLM decisions	Code enforces structural invariants (RAG tree balance)
Fail-safe defaults	Infrastructure failures → pass, never false alarm
Fuzzy output matching	Tolerant parsing of approximate LLM compliance

Each of these techniques has its own deep-dive post. The common thread: treat the LLM as a powerful but unreliable collaborator. Give it clear, bounded questions. Validate its answers. Have a fallback for when it fails. And never let it count.

The Result

On a 300-turn evaluation session: 56 alerts, 0 false positives, 100% recall. The system correctly identified every debug loop, plan drift, and stalled progress — while never crying wolf. Evaluation cycles take 20-40 seconds on an M-series Mac. Non-evaluation turns cost less than 1 millisecond.

✨

Deterministic from non-deterministic

The whole point of this architecture is to produce deterministic supervision behavior from a non-deterministic model. Code defines when to evaluate, what to measure, and how to structure the session. The LLM adds richness — titles, summaries, nuanced judgment — but the structural guarantees hold even if you swap the model tomorrow.