Skip to main content

Command Palette

Search for a command to run...

I Built a Self-Measuring Hook System for Claude Code — Here's What 427+ Automated Runs Taught Me

Published
10 min read
I Built a Self-Measuring Hook System for Claude Code — Here's What 427+ Automated Runs Taught Me
S

FullStack / ETL Developer. Algo-Trader at NSE. Blockchain Enthusiast.

Most people using AI coding agents treat them like a magic autocomplete. Type a prompt, get code, ship it.

I did that too. Then things started breaking in weird ways — production code without tests, force-pushes to main at 2am, context getting wiped mid-session and Claude forgetting critical rules.

So I stopped and asked: what if the AI agent had guardrails that enforced discipline automatically — and measured themselves?

Over the past few weeks, I've built a hook system for Claude Code inside our fintech monorepo. Not a few scripts stitched together — a full lifecycle of pre-checks, post-validations, audit trails, context recovery, and a metrics layer that snapshots progress daily.

Here's what it looks like, what each hook does, and what the numbers are telling me.


What Are Claude Code Hooks?

Think of hooks like airport security checkpoints for your AI agent.

Before Claude runs a bash command → a pre-tool hook inspects it. After Claude edits a file → a post-tool hook validates it. When Claude's context gets compacted (memory reset) → a compact hook saves and restores critical rules.

Every hook runs automatically. Zero manual intervention. Claude doesn't even know it's being watched — it just gets blocked or warned when it tries something it shouldn't.


The Hook Categories (and Why Each One Exists)

🛡️ 1. Safety Hooks — The Seatbelts

Problem: AI agents can run any bash command. Including destructive ones.

Hook: pre-tool-bash — intercepts every bash command before execution and blocks patterns like:

→ rm -rf /           ← blocked
→ DROP DATABASE       ← blocked
→ git push --force main  ← blocked
→ redis-cli FLUSHALL     ← blocked

Real-world analogy: It's like a car that physically won't start if the seatbelt isn't on. You don't rely on the driver remembering — the system enforces it.

Impact: Zero destructive commands have gotten through. Ever.


🧪 2. TDD Enforcement — The Bouncer

Problem: It's tempting to write production code first and "add tests later." With AI agents, this temptation is 10x worse — Claude can generate code so fast that tests become an afterthought.

Hook: pre-tool-tdd — blocks any write to a .ts production file if the corresponding .spec.ts test file doesn't exist yet.

Claude tries to write: apps/smf-core/src/services/ledger.service.ts

Hook checks: does ledger.service.spec.ts exist?
  → No? BLOCKED. Exit code 2. Write rejected.
  → Yes? Proceed.

Smart enough to exempt files that don't need tests — type definitions, enums, constants, DTOs, barrel exports.

Real-world analogy: A bouncer at a club checking your ID. No test file = no entry. Doesn't matter how good the code looks.

Current metric: 100% TDD compliance. Every production file has its test. Not because I'm disciplined — because the hook won't let me skip it.


🔍 3. Auto-Typecheck — The Spell Checker

Problem: TypeScript errors compound fast in a monorepo. One bad type change ripples across 6 packages. By the time you notice, you've built 3 more features on a broken foundation.

Hook: post-tool-typecheck — runs a scoped typecheck on the affected workspace immediately after every .ts file edit.

Claude edits: apps/auth-service/src/auth.controller.ts

Hook triggers: pnpm turbo run typecheck --filter=auth-service
  → Pass? Silent. Continue working.
  → Fail? Warning with exact error. Fix before moving on.

Real-world analogy: Like spell-check highlighting errors as you type — not waiting until you hit "submit."

Why it matters: Catches type regressions within seconds of introduction, not hours later during a build.


🎯 4. Agent Model Governance — The Right Specialist for the Right Job

Problem: Our monorepo uses multiple Claude agents — Haiku for lightweight services (card-processor, auth), Sonnet for complex business logic (smf-core, integration tests), Opus for final PM review. Spawning the wrong model wastes tokens and time.

Hook: pre-tool-agent — validates that each spawned agent uses the expected model tier.

Spawning: card-processor agent
Expected model: haiku
Actual model: sonnet
  → WARNING: Model mismatch detected. Consider using haiku.

Real-world analogy: Like a hospital assigning the right specialist to the right procedure. You don't send the chief surgeon to check a blood pressure reading.

Current metric: 57.1% model match rate. Yes, that's low — and the hook is exactly how I know it's low. Without measurement, I'd have assumed it was fine. Now I'm actively improving it.


🧠 5. Context Recovery — Saving Your Game Before the Boss Fight

Problem: Claude Code compacts context when conversations get long. When it compacts, it can forget critical rules — like "always write tests first" or "never modify shared-types directly."

Hooks: pre-compact + post-compact-inject working as a pair.

BEFORE compaction:
  → pre-compact takes a snapshot:
    - Current status files
    - Git state
    - 5 critical rules

AFTER compaction:
  → post-compact-inject restores:
    - All 5 critical rules as additionalContext
    - The pre-compact snapshot for state recovery

The 5 critical rules that survive every compaction:

  1. TDD: failing test first, then implementation

  2. Typecheck after every edit

  3. Never modify shared-types directly

  4. Modern Treasury metadata must include smf_type

  5. Phase 3 review is mandatory before commit

Real-world analogy: Like saving your game before a boss fight. If you die (context reset), you reload from the save point — not from the beginning.

Current metric: 6 compaction events survived with zero rule amnesia.


📝 6. Audit Trail — The Black Box Recorder

Problem: When something goes wrong in a long AI session, you need to trace what happened. "What commands did Claude run? What files did it touch? Which skills did it use?"

Hooks: Three loggers running in parallel:

command-logger  → logs every bash command (rolling 200 entries)
skill-logger    → logs every skill invocation (rolling 500 entries)
subagent-logger → logs every agent spawn (model, type, description)
write-audit     → logs every file write with timestamp

Plus a shared error trap (_err-trap) that captures hook failures with timestamp, exit code, and exact line number.

Real-world analogy: A flight data recorder (black box). You hope you never need it. But when you do, it tells you exactly what happened.

Current metric: 227 file operations logged. 200 commands tracked. 10 skill/agent invocations recorded. Every single one traceable.


🗼 7. Orchestration Hooks — Air Traffic Control

Problem: In a multi-agent monorepo, agents can block each other. One agent hits an unresolvable issue → the whole pipeline needs to halt. Another agent finishes → the next phase should begin.

Hook: subagent-stop — runs when any sub-agent completes:

Agent finishes → hook checks status files:
  → PROJECT_BLOCKED.md found? HALT EVERYTHING. Surface to human.
  → BLOCKED.md found? Log it. Attempt retry.
  → FIXED.md found? All clear. Advance to next phase.

Real-world analogy: Air traffic control. Every plane (agent) reports its status on landing. ATC (orchestrator) decides what takes off next.


The Metrics Layer — Hooks That Measure Themselves

Here's where it gets interesting.

The hooks don't just enforce rules — they generate structured telemetry. Every blocked command, every TDD violation, every model mismatch, every compaction event gets logged as a JSONL entry with timestamps.

Then a /metrics skill reads all that telemetry, computes 6 core metrics, compares against the previous day's snapshot, and shows trends:

┌──────────────────────┬─────────
│ Metric               │ Today   │ Trend           │
├──────────────────────┼─────────
│ Hook Reliability     │ 98.6%   │   →            │
│ TDD Compliance       │ 100%    │   →            │
│ Agent Model Match    │ 57.1%   │   ↑             │
│ Typecheck Pass Rate  │ pending │   —             │
│ Review First-Pass    │ pending │   —             │
│ Compaction Events    │ 6/week  │   →            │
└──────────────────────┴─────────

Daily snapshots get saved to .claude/metrics/snapshots/ — one JSON file per day. Over time, this builds a progress timeline showing exactly where discipline is improving and where it's slipping.

What the Numbers Are Telling Me

  • 98.6% hook reliability across 427+ automated runs — 4 errors total, all from the agent model hook during early calibration. The system is stable.

  • 100% TDD compliance — not a single production file exists without its test. The bouncer works.

  • 57.1% agent model match — this is the weakest spot, and I only know because the hook is measuring it. Active improvement area.

  • 6 compaction survivals — context got wiped 6 times, and all 5 critical rules were restored every time. Zero rule amnesia.

Supporting Skills in the Ecosystem

The metrics don't just sit in JSON files. They feed into a broader analysis ecosystem:

  • /health-check — runs at session start, reports hook errors from the last 7 days

  • /reflect — analyzes command logs and skill usage to suggest new automation rules

  • /retro — weekly retrospective with week-over-week trend comparison

  • analyze-patterns.sh — detects repeated commands (3+ times) and suggests turning them into skills

It's a feedback loop: hooks enforce → metrics measure → skills analyze → rules improve → hooks enforce better.


The Flywheel Effect

Here's the mental model I keep coming back to:

     ┌──────────────┐
     │   Hooks       │ ← enforce rules automatically
     │   (guardrails)│
     └──────┬───────┘
            │ generate
            ▼
     ┌──────────────┐
     │   Metrics     │ ← structured telemetry
     │   (JSONL)     │
     └──────┬───────┘
            │ feed into
            ▼
     ┌──────────────┐
     │   Skills      │ ← /metrics, /reflect, /retro
     │   (analysis)  │
     └──────┬───────┘
            │ suggest
            ▼
     ┌──────────────┐
     │   Rules       │ ← CLAUDE.md, new hooks
     │   (improved)  │
     └──────┬───────┘
            │ enforced by
            ▼
     ┌──────────────┐
     │   Hooks       │ ← cycle repeats, tighter each time
     └──────────────┘

Every cycle, the system gets tighter. The agent gets more disciplined. The code gets more reliable. And I have the numbers to prove it's actually happening.


What I'd Tell Someone Starting This Today

  • Start with safety hooks. Block destructive commands before anything else. It takes 10 minutes and saves you from catastrophic mistakes.

  • Add TDD enforcement early. Once you have 500 files without tests, it's too late. The bouncer is easiest to install when the club is empty.

  • Measure everything, even the embarrassing numbers. My 57% model match rate isn't pretty. But I can only improve what I can see.

  • Build the compact recovery hooks. Context loss is the silent killer of long AI sessions. Your agent will forget your most important rules at the worst possible time.

  • Close the feedback loop. Hooks without metrics are just guardrails. Hooks WITH metrics are a self-improving system.


AI coding agents are powerful. But power without guardrails is just chaos with better autocomplete.

Build the guardrails. Measure the guardrails. Let the guardrails measure themselves.

That's how you go from "AI writes code for me" to "AI writes code the way I want it to."

🔧 Building this at OrbitxPay — where we're wiring traditional banking to onchain infrastructure. If you're building with AI agents in production, I'd love to hear what guardrails you've put in place.

#ClaudeCode #AIAgents #DevTooling #BuildingInPublic #EngineeringLeadership #Fintech