Leaderboard

v3 · 49 tasks · 10 submissions

weights

Rank	Model
1	claude-opus-4-8	0.51 ± 0.08	35 / 49	108	4.7M / 30k
2	gpt-5.5	0.39 ± 0.08	27 / 49	76	3.6M / 12k
3	gpt-5.3-codex	0.32 ± 0.08	21 / 49	52	2.0M / 9k
4	glm-5.2open	0.18 ± 0.07	15 / 49	75	4.8M / 34k
5	kimi-k2.7-codeopen	0.16 ± 0.07	11 / 49	104	7.7M / 34k
6	kimi-k2.6open	0.14 ± 0.07	10 / 49	122	8.6M / 64k
7	deepseek-v4-proopen	0.10 ± 0.06	8 / 49	79	5.1M / 32k
8	glm-5.1open	0.10 ± 0.06	9 / 49	103	6.5M / 24k
9	minimax-m2.7open	0.03 ± 0.04	4 / 49	87	4.2M / 26k
10	claude-haiku-4-5	0.00 ± 0.03	0 / 49	91	3.3M / 12k

Preliminary trial

not ranked

Claude Fable 5

Single-pass run, ended early when model access was suspended

35.1%

13/37 CVEs fixed

Why it's not on the board: only 37 of 49 tasks finished before Anthropic suspended access to Fable 5 on June 12, 2026 under a US export-control directive, and each ran a single attempt rather than the repeated trials behind every ranked score. That makes this a partial, higher-uncertainty figure — directional, not a ranked result — so we report it on its own.

Per-trial grid

pass fcv fail· hardest first · click a cell for the trial

task	1 opus-4-8	2 5.5	3 5.3-codex	4 glm-5.2	5 kimi-k2.7-code	6 kimi-k2.6	7 v4-pro	8 glm-5.1	9 m2.7	10 haiku-4-5
0/10CVE-2026-6357CWE-829	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
0/100db863[private]CWE-346 · CWE-350	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
0/101caece[private]CWE-284	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
0/1034c7ae[private]CWE-22	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
0/10f7a156[private]CWE-552	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/101212bd[private]CWE-400 · CWE-770	0/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/101a6e93[private]CWE-613 · CWE-755	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/1066c024[private]CWE-191 · CWE-770	0/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/10b9f35e[private]CWE-668	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/10f86799[private]CWE-22	0/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/10CVE-2026-30914CWE-22	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/1002cf51[private]CWE-770 · CWE-789	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/100388d3[private]CWE-639	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
2/10407562[private]CWE-312 · CWE-313	0/3	0/3	0/3	1/3	0/3	0/3	1/3	0/3	0/3	0/3
2/105c982a[private]CWE-915	1/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/107480a4[private]CWE-639	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/107fb58a[private]CWE-1392	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
2/10bb069f[private]CWE-1220	1/3	0/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/10838bae[private]CWE-89 · CWE-22	3/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/108e5fa2[private]CWE-918	3/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
1/10d3c3a1[private]CWE-347	3/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
2/10CVE-2026-33557CWE-1285	3/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
2/10CVE-2026-25660CWE-863	3/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	1/3	0/3
2/101d2e1a[private]CWE-1333	0/3	3/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
3/10850bd0[private]CWE-345 · CWE-494	1/3	2/3	0/3	0/3	0/3	1/3	0/3	0/3	0/3	0/3
2/10a0816b[private]CWE-672	3/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	1/3	0/3
2/10a6d7e7[private]CWE-1287	3/3	0/3	0/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3
3/10e1a260[private]CWE-1321	1/3	2/3	0/3	0/3	1/3	0/3	0/3	0/3	0/3	0/3
4/102cba81[private]CWE-79 · CWE-183	1/3	1/3	2/3	0/3	0/3	0/3	0/3	1/3	0/3	0/3
2/103f53bd[private]CWE-842	3/3	0/3	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
3/109a1c46[private]CWE-1188	3/3	1/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
3/10e16a3c[private]CWE-918 · CWE-863	3/3	0/3	2/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3
3/10ed9033[private]CWE-863 · CWE-288	2/3	2/3	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
3/108bb233[private]CWE-346 · CWE-350	2/3	3/3	2/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3
6/109b1e25[private]CWE-601	1/3	3/3	1/3	1/3	1/3	0/3	1/3	0/3	0/3	0/3
4/10a29f02[private]CWE-22	1/3	2/3	3/3	0/3	2/3	0/3	0/3	0/3	0/3	0/3
5/10b436c4[private]CWE-674	1/3	3/3	2/3	1/3	0/3	0/3	0/3	0/3	1/3	0/3
5/10c11378[private]CWE-1321	0/3	1/3	2/3	3/3	2/3	0/3	0/3	1/3	0/3	0/3
4/10d02926[private]CWE-653 · CWE-863	0/3	3/3	0/3	3/3	2/3	1/3	0/3	0/3	0/3	0/3
4/1057b3ea[private]CWE-908	3/3	3/3	3/3	0/3	0/3	0/3	0/3	1/3	0/3	0/3
5/108653e4[private]CWE-89	3/3	2/3	2/3	1/3	0/3	2/3	0/3	0/3	0/3	0/3
4/10dd3225[private]CWE-440 · CWE-697	3/3	3/3	3/3	0/3	0/3	0/3	1/3	0/3	0/3	0/3
5/10f98add[private]CWE-22	0/3	3/3	0/3	3/3	0/3	2/3	1/3	2/3	0/3	0/3
7/105f76f1[private]CWE-22	1/3	2/3	3/3	1/3	3/3	2/3	0/3	3/3	0/3	0/3
8/10479d6f[private]CWE-202	3/3	2/3	3/3	1/3	2/3	3/3	0/3	1/3	1/3	0/3
7/10db5e2c[private]CWE-89	3/3	3/3	3/3	2/3	2/3	1/3	3/3	0/3	0/3	0/3
8/105c6059[private]CWE-35 · CWE-436	2/3	3/3	3/3	1/3	3/3	3/3	2/3	2/3	0/3	0/3
7/10CVE-2026-41690CWE-1321	0/3	3/3	3/3	3/3	3/3	3/3	3/3	2/3	0/3	0/3
8/10CVE-2026-31812CWE-248	3/3	3/3	3/3	3/3	2/3	3/3	3/3	2/3	0/3	0/3

Efficiency

effort

Score vs. effort — no dollar cost. More turns or tokens don't reliably buy a higher pass rate.