CVEGym
CVEGym is a benchmark for real vulnerability repair. Models receive vulnerable open-source repositories and must submit patches that actually close disclosed CVEs.
Every task is a real, recently-disclosed CVE in open-source code. A patch counts only when a held-out exploit confirms the vulnerability is closed and the project's own tests still pass. Run and maintained by DeepSource.
- 51% ±8% · solved 35/49
- 39% ±8% · solved 27/49
- 32% ±7% · solved 21/49
- 18% ±6% · solved 15/49
- 16% ±6% · solved 11/49
- 14% ±6% · solved 10/49
- 10% ±5% · solved 8/49
- 10% ±5% · solved 9/49
- 3% ±3% · solved 4/49
- 0% ±1% · solved 0/49
- claude-opus-4-8
- gpt-5.5
- gpt-5.3-codex
- glm-5.2
- kimi-k2.7-code
- kimi-k2.6
- deepseek-v4-pro
- glm-5.1
- minimax-m2.7
- claude-haiku-4-5
- 51%±8%
- 39%±8%
- 32%±7%
- 18%±6%
- 16%±6%
- 14%±6%
- 10%±5%
- 10%±5%
- 3%±3%
- 0%±1%
- 35/49
- 27/49
- 21/49
- 15/49
- 11/49
- 10/49
- 8/49
- 9/49
- 4/49
- 0/49
CVEGym is purpose-built to measure whether frontier models can close a real vulnerability, which demands a different benchmark design across four axes:
- Real vulnerabilities: Every task is a real, recently-disclosed CVE planted in real open-source code.
- Behavioral oracles: A patch passes only if a held-out security test passes and the project's own tests stay green. We grade the fix, not the diff.
- Contamination-resistant: Recent CVEs plus ported tasks that transplant a bug's shape into a different repo, never merged upstream. Only 6 tasks in the benchmark are publicly disclosed; all others are kept private.
- Audited for integrity: We measure FCV and reward-hacking to detect and penalize patches that pass the tests without actually closing the vulnerability.
Together, these axes keep the pass rate honest. A trial counts only when the model actually closed the CVE, on code it likely hasn't seen, without breaking the project.
Public tasks
All 49 tasks- CVE-2026-30914
Path traversal via backslash normalization discrepancy across FTP/SFTP handlers and VFS backends allows directory escape.
highCWE-22 - CVE-2026-33557
Apache Kafka does not validate JWT tokens in its OAUTHBEARER authentication implementation
criticalCWE-1285 - CVE-2026-41690
Port of i18next-http-middleware CVE-2026-41690 prototype-pollution-via-dotted-segment into mashpie/i18n-node. The i18n-node library already hosts an analogous dotted-key traversal in localeMutator / localeAccessor (i18n.js): `singular.split(objectNotation).reduce((object, index) => ... object[index] = value, locales[locale])`. The vulnerable port weakens the per-segment existence check from Object.prototype.hasOwnProperty.call to a plain `object[index] === undefined`, matching the shape of i18next-http-middleware's pre-3.9.3 setPath. Attacker-controlled dotted keys reaching the public `__` / `__n` API then walk through inherited Object members and write to the shared prototype — leading (`__proto__.polluted`), non-leading (`foo.__proto__.x` — the v3.9.7-style follow-up bypass), and deeper constructor.prototype chains all reach Object.prototype. Fix re-introduces the segment guard at every position via a small lib/safe-set.js helper referenced from both traversal functions.
highCWE-1321 - CVE-2026-25660
Authentication bypass in CodeChecker permission system grants unauthenticated users full access when authentication is enabled.
criticalCWE-863 - CVE-2026-6357
pip Vulnerable to Inclusion of Functionality from Untrusted Control Sphere
highCWE-829 - CVE-2026-31812
quinn-proto transport parameter parsing panics on truncated varints via bare .unwrap() calls, enabling unauthenticated remote DoS with a single UDP packet.
highCWE-248