CVEGym v1.0 is live
OSS-Fuzz finds the bug. CVEGym measures who can fix it.
The first CVEGym leaderboard is up: 10 frontier models against 49 real, recently-disclosed CVEs, 3 attempts each, for 1,470 graded patch attempts in total. Every task hands the model a vulnerable codebase and asks for the fix. A patch only counts when it does two things at once: it closes the vulnerability and leaves the rest of the project working.
The headline: patching real CVEs is hard
The top model lands a correct patch on fewer than half its attempts, and the field falls off fast from there. Across the whole panel, 44 of the 49 tasks were solved by at least one model, and 5 were solved by nobody. These are recent real-world vulnerabilities, and the distance between writing plausible code and shipping the actual fix is the whole point of the benchmark.
Two patterns are worth reading off the board. First, pass rate and tasks-solved disagree: the model that tops the pass-rate ranking isn't the one that solves the most distinct tasks, because one model gets more CVEs right at least once but does it less consistently across its three tries. Both numbers are on the board and they measure different things. Second, effort spans an order of magnitude. Two models can post nearly the same pass rate while differing several-fold in the turns and tokens they spend getting there, so the board reports effort as turns and tokens per trial alongside the score.
How a patch is graded
Two oracles, both have to pass.
- The security test asks whether the patched code now rejects the exploit. For native tasks that's the regression test that shipped with the upstream fix; for ported tasks it's a hand-written structural checker.
- The functional suite asks whether the patch left unrelated behaviour intact. This is the hard gate against "fixes" that close the CVE by quietly breaking the feature around it.
We rank by Pass@1 with a 95% confidence interval. A structural-quality score (how surgical the diff is) only breaks ties; it never drives the ranking. Full detail is on the methodology page.
Keeping it honest
A patch benchmark is only meaningful if the model hasn't already memorised the fix, so we lean on two defenses and ask nothing of the providers.
- Recency. A CVE is admitted only within 30 days of its advisory, which lands inside most training cutoffs.
- Porting. Some tasks transplant a real bug's shape into different public code, so a memorised advisory doesn't carry over. The board mixes native tasks (the bug in its original repo) with ported ones.
Everything runs on public open-source code, with no proprietary repos. We publish the full identity of six public exemplar tasks — prompt, oracle tests, gold patch, and per-model trajectories — so you can see exactly what a CVEGym task looks like:
go-CVE-2026-30914maven-kafka-CVE-2026-33557-jwt-validatornpm-i18n-node-CVE-2026-41690-proto-pollutionpip-CVE-2026-25660pip-CVE-2026-6357rust-quinn-CVE-2026-31812
The rest run identity-withheld. Every model's per-task score is published, but the task shows up under an opaque id with the advisory, repo, and patch redacted, so nobody can reverse-engineer the set and game against it. This is a contamination defense; the code is still public.
How runs work
DeepSource runs every evaluation directly. There's no self-submission path. Each model runs under the agent harness it's built for — frontier models through their vendor-native CLI, open-weight models through a shared open-source agent — and the harness is held fixed per model across the panel, so a model's score reflects the model, not a harness we picked for it. If a row ever has to be struck, whether for a methodology bug, a provider route change, or contamination evidence, it stays on the board at its original rank, struck through, with the reason, and the retraction is posted here.
What's next
This is v1.0 at a single disclosure tier, where the model is told the bug class but not its location or fix. A few things follow from here.
- The disclosure ladder. The same task can be posed five ways, from a bare "there is a vulnerability in this repo" up to naming the class, the attack, and the subsystem. v1.0 reports the
classtier; sweeping the full ladder shows how much of a model's score is capability versus how much is the hint, which is the more honest measure of whether it can find and fix a bug on its own. - More tasks, and fresher ones. The set grows as new CVEs are disclosed. Because admission favors a 30-day window, the board keeps moving toward vulnerabilities that landed after the current models' training cutoffs — the cleanest test of repair, not recall.
- The contamination view. Once per-task disclosure dates and per-model cutoffs are published together, each cell can be read against the odds the fix was in training data. That view stays withheld until both halves are pinned down, so it's a measurement and not a guess.
- New models as they ship. When a lab releases a model, it goes on the board against the same tasks under the same grading, so the generational delta is a like-for-like number rather than a press-release one.
The takeaway
Writing a plausible patch and closing a real vulnerability are different skills, and the gap between them is exactly what most evaluations miss. CVEGym grades the second one: a patch counts only when a held-out exploit confirms the CVE is closed and the project still runs. On that bar, the best frontier model today clears barely half, and five of the 49 CVEs resist every model on the board. That distance is the point — it's the room left for AI to actually do this job, measured honestly. Want managed runs for your own models? Get in touch.