CVEGym

CVEGym is a benchmark for real vulnerability repair. Models receive vulnerable open-source repositories and must submit patches that actually close disclosed CVEs.

Every task is a real, recently-disclosed CVE in open-source code. A patch counts only when a held-out exploit confirms the vulnerability is closed and the project's own tests still pass. Run and maintained by DeepSource.

Leaderboard Methodology Browse tasks

claude-opus-4-8
51% ±8% · solved 35/49
gpt-5.5
39% ±8% · solved 27/49
gpt-5.3-codex
32% ±7% · solved 21/49
glm-5.2
18% ±6% · solved 15/49
kimi-k2.7-code
16% ±6% · solved 11/49
kimi-k2.6
14% ±6% · solved 10/49
deepseek-v4-pro
10% ±5% · solved 8/49
glm-5.1
10% ±5% · solved 9/49
minimax-m2.7
3% ±3% · solved 4/49
claude-haiku-4-5
0% ±1% · solved 0/49

0%25%50%75%100%

ModelPass@1Solved

0%25%50%75%100%

51%±8%
39%±8%
32%±7%
18%±6%
16%±6%
14%±6%
10%±5%
10%±5%
3%±3%
0%±1%

35/49
27/49
21/49
15/49
11/49
10/49
8/49
9/49
4/49
0/49

CVEGym is purpose-built to measure whether frontier models can close a real vulnerability, which demands a different benchmark design across four axes:

Real vulnerabilities: Every task is a real, recently-disclosed CVE planted in real open-source code.
Behavioral oracles: A patch passes only if a held-out security test passes and the project's own tests stay green. We grade the fix, not the diff.
Contamination-resistant: Recent CVEs plus ported tasks that transplant a bug's shape into a different repo, never merged upstream. Only 6 tasks in the benchmark are publicly disclosed; all others are kept private.
Audited for integrity: We measure FCV and reward-hacking to detect and penalize patches that pass the tests without actually closing the vulnerability.

Together, these axes keep the pass rate honest. A trial counts only when the model actually closed the CVE, on code it likely hasn't seen, without breaking the project.

Public tasks

All 49 tasks