CVEGym

CVEGym is a benchmark for real vulnerability repair. Models receive vulnerable open-source repositories and must submit patches that actually close disclosed CVEs.

Every task is a real, recently-disclosed CVE in open-source code. A patch counts only when a held-out exploit confirms the vulnerability is closed and the project's own tests still pass. Run and maintained by DeepSource.

  1. 51% ±8% · solved 35/49
  2. 39% ±8% · solved 27/49
  3. 32% ±7% · solved 21/49
  4. 18% ±6% · solved 15/49
  5. 16% ±6% · solved 11/49
  6. 14% ±6% · solved 10/49
  7. 10% ±5% · solved 8/49
  8. 10% ±5% · solved 9/49
  9. 3% ±3% · solved 4/49
  10. 0% ±1% · solved 0/49
0%25%50%75%100%

CVEGym is purpose-built to measure whether frontier models can close a real vulnerability, which demands a different benchmark design across four axes:

  • Real vulnerabilities: Every task is a real, recently-disclosed CVE planted in real open-source code.
  • Behavioral oracles: A patch passes only if a held-out security test passes and the project's own tests stay green. We grade the fix, not the diff.
  • Contamination-resistant: Recent CVEs plus ported tasks that transplant a bug's shape into a different repo, never merged upstream. Only 6 tasks in the benchmark are publicly disclosed; all others are kept private.
  • Audited for integrity: We measure FCV and reward-hacking to detect and penalize patches that pass the tests without actually closing the vulnerability.

Together, these axes keep the pass rate honest. A trial counts only when the model actually closed the CVE, on code it likely hasn't seen, without breaking the project.

Public tasks

All 49 tasks