Leaderboard
v3 · 49 tasks · 10 submissions
weights
| Rank | Model | ||||
|---|---|---|---|---|---|
| 1 | 0.51 ± 0.08 | 35 / 49 | 108 | 4.7M / 30k | |
| 2 | 0.39 ± 0.08 | 27 / 49 | 76 | 3.6M / 12k | |
| 3 | 0.32 ± 0.08 | 21 / 49 | 52 | 2.0M / 9k | |
| 4 | glm-5.2open | 0.18 ± 0.07 | 15 / 49 | 75 | 4.8M / 34k |
| 5 | kimi-k2.7-codeopen | 0.16 ± 0.07 | 11 / 49 | 104 | 7.7M / 34k |
| 6 | kimi-k2.6open | 0.14 ± 0.07 | 10 / 49 | 122 | 8.6M / 64k |
| 7 | deepseek-v4-proopen | 0.10 ± 0.06 | 8 / 49 | 79 | 5.1M / 32k |
| 8 | glm-5.1open | 0.10 ± 0.06 | 9 / 49 | 103 | 6.5M / 24k |
| 9 | minimax-m2.7open | 0.03 ± 0.04 | 4 / 49 | 87 | 4.2M / 26k |
| 10 | 0.00 ± 0.03 | 0 / 49 | 91 | 3.3M / 12k |
Preliminary trial
not rankedClaude Fable 5
Single-pass run, ended early when model access was suspended
35.1%
13/37 CVEs fixed
Why it's not on the board: only 37 of 49 tasks finished before Anthropic suspended access to Fable 5 on June 12, 2026 under a US export-control directive, and each ran a single attempt rather than the repeated trials behind every ranked score. That makes this a partial, higher-uncertainty figure — directional, not a ranked result — so we report it on its own.
Per-trial grid
pass fcv fail· hardest first · click a cell for the trialEfficiency
effort
Score vs. effort — no dollar cost. More turns or tokens don't reliably buy a higher pass rate.