Rendered from docs/scoreboard/README.md — the
Markdown in the repo is the source of truth; this page is generated by
scripts/build_incident_pages.py, never hand-edited.
How AI built the software you already use
Agents now write a real share of the popular open-source projects you depend on — and they write their own commit messages too. This board looks at the recent history of well-known repos and asks three plain questions: how much of it did AI write, which agent did it, and what kind of work was it — fixes, tests, docs.
The catch is that a commit message is just text the agent typed; the diff is what git actually recorded, and the two can disagree. So every number here is checked against the diff, never the message alone. That is the difference between this board and a star count: it reads the thing that can't be talked up.
The picture
Three views of the same audited history. Every figure is generated from the committed per-repo data — no live calls, reproducible offline by anyone who clones the repo.
Across these 19 repos, claude is the most prolific agent — it wrote 63% of all the AI-authored commits here, with 7 other toolchains sharing the rest, and 75% of what they all claimed was shipping code, not tests or docs.
Score your own repo in one command
pip install dos-kernel
dos commit-audit --sweep --workspace . BASE..HEAD
That is the exact same check the board runs, on your history — before you trust the next "done". No account, no upload, no one named.
Start here — the auditor grades itself
We ran the check on our own repo first and published whatever it said. It says non-zero — a few commits that claim a fix but touched nothing. They're a deliberate house convention, and the page shows exactly why. We left them in. A scoreboard that airbrushed its own page to zero wouldn't be worth reading.
- anthony-chaudhary/dos-kernel — our own grade, every flag explained.
Repo by repo
The detail behind the charts — each repo's AI-built share, the agents that did it, and whether every checkable claim was backed by its own diff. Sorted by AI-built share. Click a repo for the full receipt.
| Repo | AI-built | Agents | Claims checked | Backed |
|---|---|---|---|---|
| kenn-io/roborev | 65% | claude 430 · copilot 1 · cursor 1 | 273 | 100% |
| JuliusBrussee/caveman | 32% | claude 65 | 49 | 100% |
| getzep/graphiti | 15% | claude 127 | 66 | 100% |
| pydantic/pydantic-ai | 9% | claude 188 · devin 7 · copilot 4 · … | 139 | 100% |
| openai/codex | 5% | codex 331 · claude 10 · copilot 3 | 155 | 100% |
| exo-explore/exo | 4% | claude 99 · cursor 1 · jules 1 | 67 | 100% |
| OpenInterpreter/open-interpreter | 4% | codex 240 · claude 10 · copilot 3 | 118 | 100% |
| assistant-ui/assistant-ui | 4% | claude 119 · copilot 12 · devin 2 · … | 79 | 100% |
| crewAIInc/crewAI | 3% | devin 51 · claude 29 · aider 3 · … | 69 | 100% |
| mem0ai/mem0 | 3% | claude 77 | 66 | 100% |
| agno-agi/agno | 3% | claude 159 · copilot 7 · aider 1 · … | 103 | 100% |
| charmbracelet/crush | 3% | crush 86 · copilot 9 · claude 1 | 50 | 100% |
| farion1231/cc-switch | 2% | claude 40 · copilot 1 · cursor 1 | 30 | 100% |
| livekit/agents | 2% | claude 45 · devin 17 · cursor 6 · … | 58 | 100% |
| danny-avila/LibreChat | 1% | claude 24 · copilot 13 · cursor 1 | 24 | 100% |
| microsoft/autogen | 1% | copilot 28 · claude 2 | 27 | 100% |
| unslothai/unsloth | <1% | claude 26 · cursor 2 | 22 | 100% |
| langchain-ai/langchain | <1% | copilot 24 · claude 15 | 29 | 100% |
| anthony-chaudhary/dos-kernel | — | — | 315 | 98% |
The fine print (it matters)
A mismatch is not an accusation. It does not mean the code is wrong, or that anyone lied. It means one thing only: a commit's subject claimed something its own diff doesn't show. A real fix to the wrong bug passes the check; an honest doc cleanup with a sloppy subject can flag. A message-vs-diff mismatch is never a correctness, honesty, or intent grade — only a note that a commit's words and its own diff disagree.
- How it works — exactly what the check reads, what it skips, and every time the check itself was wrong (we narrow the check, never trust the subject).
- The big picture — the population mismatch rate across public repos, with every flag hand-checked and denominators everywhere.
- The live roll-up — the published set above, folded into one aggregate by
scripts/scoreboard_rollup.py. Every number is derived from the committed per-repo data, reproducible offline. - Want your repo listed? Clean or not, it's opt-in and you see the result before it publishes. See the methodology's registration section.
The pages above are the 19 repos we've audited and named. A repo is named only when its verdict is published; a non-clean or unadjudicated verdict is reported only as a count, never as a named page (docs/311 §2).
The kernel is the part that doesn't believe the agents.