Calibration data

Six real-world CLAUDE.md files, scored by hand on the 6-axis rubric and then re-scored by the auditor skill. The auditor matched hand scores on all six (Δ=0). Per-axis breakdown + justifications below.

Source files were fetched 2026-05-21 from the main branch of each repo. The synthetic floor case (#6) was hand-written to expose every red flag in the rubric — it's the "what bad looks like" anchor.

Case	Type	Overall	Specificity	Coverage	Brevity	Currency	Quirks	Tone
`humanlayer/humanlayer`	Real monorepo	70	7	6	8	9	7	6
`midudev/miduconf-website`	Real product app	74	9	7	8	9	6	6
`shanraisshan/claude-code-best-practice`	Meta-tooling reference	78	8	8	6	9	8	7
`zircote/.claude`	Personal global	67	8	8	3	4	8	5
`VoltAgent/awesome-claude-code-subagents`	Awesome-list	59	6	6	9	7	4	6
Synthetic floor	Generic / stale model	14	1	1	6	2	0	1

Per-axis numbers above are the auditor's emit; hand scores matched at Δ=0 across all 36 axis values (6 cases × 6 axes). 0 hallucinated quotes across 36 axis justifications.

Per-case justifications

One paragraph each. The full per-axis prose (what the auditor wrote when scoring) lives in the Pack's CALIBRATION-AUDITS.md — these are the highlights.

Case 1 — `humanlayer/humanlayer` 70 / 100

Why this score: Solid monorepo map with real commands (make check-test, gh workflow run "Build macOS Release Artifacts") and a useful TODO-priority convention. Specificity 7 — strong repo-specific paths, but the "Technical Guidelines" section drifts generic ("Modern ES6+ features", "Standard Go idioms"). Quirks captured 7 — real monorepo gotchas like "Package managers vary — check package.json", "Some use Jest, others Vitest". Coverage 6 — missing do/don't and ownership pointers.

Top fix: Add a Known Gotchas section. Quirks is the highest-weighted axis with room to grow here.

Case 2 — `midudev/miduconf-website` 74 / 100

Why this score: High-specificity walkthrough of a real product (Next.js 15 + React 19 + bun, Supabase tables named explicitly). Specificity 9. But reads like a wiki — Coverage 7, Quirks 6, Tone 6. Captures architectural choices ("Legacy editor (kept for compatibility)", "Smart catch-all", "Pages Router instead of App Router") but no operational gotchas — nothing about what breaks at deploy time.

Top fix: Add a Quirks section listing the operational traps an agent would otherwise rediscover painfully.

Case 3 — `shanraisshan/claude-code-best-practice` 78 / 100

Why this score: A well-organised reference for Claude Code patterns, slightly over-budget on length but rich in real gotchas. Quirks 8 — multiple high-quality entries like "Subagents cannot invoke other subagents via bash", "Avoid vague terms like 'launch'", "Keep CLAUDE.md under 200 lines for reliable adherence". Brevity dropped to 6 because the enumerated frontmatter field lists are reference material that belongs in deeper docs.

Top fix: Move the field-enumeration tables to a linked reference doc. Every session pays the token cost; almost no session uses all of it.

Case 4 — `zircote/.claude` (personal global) 67 / 100

Why this score: Sophisticated and full of real lessons but bloated and version-locked to a retired model. Brevity 3 — ~3,000 tokens, ~4× the personal-global 800-token soft ceiling. Currency 4 — two full sections hardcoded to "Opus 4.5" while the current Claude family is 4.7; hardcoded retired model triggers the rubric's Currency cap at 5. Quirks 8 saves the overall.

Top fix: Cut the "Completed Spec Projects" history and reference the current Claude generation generically rather than pinning to a retired marketing name.

Case 5 — `VoltAgent/awesome-claude-code-subagents` 59 / 100

Why this score: Tight and on-topic for a thin meta-repo — Brevity 9 — but missing the quality-bar guidance that would actually help maintainers triage contributions. Coverage 6 under the content/awesome-list rubric substitution (the rubric gives content repos a fairer baseline — no penalty for "no build commands"). Quirks 4 — only one captured.

Top fix: Add a quality bar / review-rules section so the agent (and contributors) know what gets accepted.

Case 6 — Synthetic floor 14 / 100

Why this score: Hand-written to expose every red flag in the rubric. Specificity 1 — zero repo specifics; triggers the hard cap at 4 because the file contains "Claude is an AI assistant created by Anthropic". Currency 2 — hardcodes "Claude 3.5 Sonnet" as the model to use. Quirks 0. Tone 1 — all the LLM-flavored tells ("Welcome!", "comprehensive guide", "let's dive in!", "Whether you're a beginner or expert", 🚀). Canonical "what bad looks like."

Top fix: Full rewrite, not edits. This file tells Claude things Claude already knows; it's pure context bloat.

Methodology notes

Model: all scores produced by claude-opus-4-7, the same model the production auditor prompt targets.
Comparison: hand scores were produced first by an engineer reading each file against the rubric. Auditor scores were then run separately. Δ=0 on overall scores and all 36 axis values.
Anchoring caveat: the auditor's calibration pass was run with the hand-scored baselines visible in the same context. A blind re-run (separate session, no baseline context) is queued as a v1.1 regression check. The rubric is identical between web-form and skill-form, so anchoring effect should be small — but it should be measured before treating these scores as the canonical reference. When the blind re-run lands the deltas (if any) will replace this note.
Hallucinated quotes: zero across 36 axis justifications. The auditor only quotes phrases that appear verbatim in the file under audit.
Hard caps in play: Case 6 triggers Specificity cap at 4 (anthropic-intro boilerplate) and Currency cap at 5 (retired-model hardcode). Case 4 triggers Currency cap at 5 (Opus 4.5 hardcode).

Want the auditor skill?

The auditor that produced these scores is one of five skills in the upcoming Claude Code Power Pack. Generator + Auditor as a tight loop, Hook Pack (8 tested hooks across Linux + macOS), PR Reviewer, CI Babysitter. Ships when all 5 hit v1.

Power Pack is still in build. Ship notice will land on the home page when it's ready — no email signup, no waitlist.

Calibration data

Per-case justifications

Case 1 — humanlayer/humanlayer 70 / 100

Case 2 — midudev/miduconf-website 74 / 100

Case 3 — shanraisshan/claude-code-best-practice 78 / 100

Case 4 — zircote/.claude (personal global) 67 / 100

Case 5 — VoltAgent/awesome-claude-code-subagents 59 / 100