The drift-rate scoreboard

Do AI coding agents' commit messages match what their commits actually did? Measured from evidence the message-writer couldn't fake: the commit's own diff.

1.2% of 1,715 concrete claims in agent-authored commits were unwitnessed by the commit's own diff — across 12 public repositories, June 2026

A commit subject is a claim ("fix the race", "add tests"). The list of files the commit touched is evidence the claimant did not author. Drift is a concrete claim whose own diff contradicts its kind — an empty commit claiming a fix, a "tests pass" subject touching no test file. It is a claim-vs-diff mismatch, never a correctness or malice grade.

Report #1 — June 2026

repositories in the corpus12 (≥500 stars, active, ≥20 machine-attributed agent commits; v1 names none)
default-branch commits scanned90,500
machine-attributed agent commits12,151 (13 toolchains: Claude Code, aider, Codex, Copilot, Devin, Cursor, OpenHands, opencode, Roo, Qwen-Coder, OpenClaw, Jules, Crush)
audited (cap 300 newest per repo)3,000
made a concrete, checkable claim1,715 (the denominator; 1,285 abstained)
witnessed by the commit's own diff1,694
unwitnessed21
pooled drift rate1.2% (median per-repo 0.75%, spread 0.0–5.1%, four repos at zero)

Every fire, hand-adjudicated

The number is inspectable, never just a number. The 21 fires, de-identified, by class:

The honest reading is not "agents lie 1.2% of the time." In established, attribution-honest, human-reviewed repositories, agent-authored commit subjects are overwhelmingly witnessed by their own diffs — and a deterministic witness still finds a small real residue resting on message text alone, including three textbook empty over-claims. Repos that strip agent attribution are invisible to this method; unreviewed direct-push fleets are not in this corpus.

Methodology first

Published before any number, with the false-positive story leading: the full methodology covers what the witness reads and abstains on, the closed agent-attribution marker set (under-matching by construction — review-suggestion bots excluded), the mechanical corpus criteria, and the aggregate-only ethics floor: no repository is named without opt-in, enforced structurally — the aggregation step never receives repo names, URLs, or commit SHAs, and a test pins it.

We graded ourselves first (our own history: 2.5%, all three fires being a documented empty-stamp convention), and running the witness at corpus scale forced three witness-grammar fixes — #79, #81, #94 — each landed before publishing. The raw rate before them read 5.1%; the adjudicated truth didn't change, the grader did.

Score your own repo

pip install dos-kernel
dos commit-audit --sweep --workspace . BASE..HEAD

Same witness, your history, one command. The corpus tool itself is scripts/drift_scoreboard.py (corpus list in, identity-stripped aggregate out).