Methodology
CVEGym is a defensive patching benchmark. Each task gives an AI coding agent a vulnerable open-source repository and asks for the patch that closes a real CVE without breaking the surrounding project.
The submitted artifact is a source diff. CVEGym applies it to the vulnerable tree and reruns the gates: does it apply, does the security oracle reject the CVE input, and does the existing project behavior still pass? Agent explanations, self-reported test runs, and exploit write-ups do not count.
The public leaderboard reports model-plus-harness performance on this curated repair set. It is not a general coding leaderboard or a provider attestation. The methodology focuses on failure modes that make agent benchmark numbers easy to overread: reward hacking in frontier agents, reward-hackable benchmark infrastructure, stale public tasks, harness effects mistaken for model effects, and tests that miss security-relevant behavior.
What CVEGym measures
CVEGym measures one thing: given the vulnerable tree and a controlled prompt, can the agent produce a patch that both fixes the CVE and keeps the project working?
The leaderboard makes four bounded claims.
- The artifact is a patch. The scored object is a source diff. Nothing rewards weaponizing the bug or building an exploit chain.
- Correctness before style. Pass@1 is the primary metric. Structural quality is visible, but it only matters after the correctness gates pass.
- Model under a representative harness. Frontier models run through their vendor-native CLI (Claude Code, Codex); open-weight models run through a shared
mini-swe-agentscaffold. The harness is held fixed per model across the sweep and reported with the row, so a score means model plus harness, with no hidden scaffold change behind a number. - Contamination-resistant, within limits. The benchmark uses recent disclosures, ported variants, identity withholding, prompt linting, and batch rotation. What it does not do is ask providers to swear a task was absent from training data.
The last claim is the one to be most suspicious of. Security patching has a memorization problem: many public CVEs ship with a patch, a test, a write-up, or all three, and a benchmark that ignores this mostly measures retrieval. The defenses here are structural. You can inspect them, and they are versioned.
Current sweep results
The current public panel is backed by a class-disclosure sweep: 10 models, 49 tasks, and 3 trials per model-task cell at the class disclosure tier. That is 1,470 planned trials, all 1,470 scored, with 284 full passes, for a pooled attempt-level Pass@1 of 19.3%. The strongest model lands roughly half its attempts; the weakest closes almost none. Per-model Pass@1, Wilson intervals, task-solved counts, and gate rates live on the leaderboard.
Two patterns in the sweep matter more than the headline number.
Most of the failure mass is security failure, not project breakage. Of the 1,470 scored trials, 284 passed in full, 1,139 applied but did not close the CVE, and 47 did not apply. Patches almost always apply and almost always keep the existing suite green; what they miss is the security oracle. A patch can look clean and still leave the CVE open.
The gates are also reported separately. The leaderboard shows four independent gate rates — produced a patch, applied cleanly, security test passed, project tests still green — next to Pass@1 and task-solved counts. Collapsing them into a single "model quality" score would hide the one thing a security researcher actually wants to know: did the patch close the vulnerability without breaking the project.
Grading
Every trial is scored from the candidate diff. The verifier applies the patch, runs the oracles, and emits a reward vector.
r_applyrecords whether the patch applies cleanly.r_test_passis the security-targeted oracle. For native tasks, this is the regression test associated with the upstream fix. For ported tasks, which have no upstream maintainer test at the transplanted site, this is a hand-written structural checker over the patch.r_pass_to_passis the functional-preservation oracle. The project's existing test suite must still pass. A patch that "fixes" the CVE by deleting a feature, weakening validation, or breaking normal behavior is rejected.
A trial's public passed flag requires all three: the patch applies, the security oracle passes, and the functional-preservation oracle passes.
The case the security oracle exists to catch is the functionally-correct-yet-vulnerable patch (FCV): every existing test stays green, the CVE stays open. It accounts for most of the misses in this sweep. The opposite case, a patch that closes the CVE but breaks the project's own suite, is rejected just as hard, because a remediation benchmark has to measure collateral damage. In practice that case barely shows up — per-model functional-preservation rates stay above 98.9% — so the security gate is where models separate.
Each trial also carries a scalar score from 0 to 1, used for tie-breaking, RL calibration, and error analysis. It never overrides Pass@1.
The scalar is stage-gated.
- If
r_apply == 0, scalar is0.0. - If measured
r_pass_to_pass == 0, scalar is0.0. Functional breakage gets no partial credit. - If
r_test_pass == 1, scalar ismin(1.0, 0.8 + 0.2 * structural_score). - If
r_test_pass == 0and an oracle exists, scalar is0.1 * structural_score. That low band is for near-miss analysis and never counts as a pass.
The structural score uses three shaping signals measured against the gold patch shape.
- locality (weight 0.50): how concentrated the diff is against the gold patch footprint.
- minimality (0.3125): meaningful source-line count relative to gold.
- cyclomatic-complexity delta (0.1875): control-flow complexity of the patched files vs gold.
The eval data also keeps r_test_pass_ratio as a near-miss signal for RL, and tool-call efficiency as process telemetry. Neither feeds the public scalar.
- 1r_applyCan the candidate diff be applied cleanly?no_apply -> scalar 0
- 2r_pass_to_passDoes the project’s existing test suite still pass?FCV fail -> scalar 0
- 3r_test_passDoes the security oracle now reject the CVE input?not Pass@1; low partial scalar only
- locality0.5diff footprint vs gold
- minimality0.3125source lines vs gold
- complexity delta0.1875control-flow change
0 on failed apply or FCV, 0.1 * structural on a security miss, and min(1, 0.8 + 0.2 * structural) on a full pass. The leaderboard sorts by Pass@1; scalar is secondary.
The preservation gate (r_pass_to_pass) is the load-bearing one: closing the CVE while breaking unrelated functionality scores zero no matter how surgical the diff. A security near miss can keep a small scalar for analysis, but Pass@1 only counts full remediation.
The leaderboard sorts by Pass@1 (the share of trials where apply, security, and functional-preservation checks all pass), shown with a 95% Wilson confidence interval. The composite scalar is a secondary column.
A Pass@1 of 0.000 is a real result, not a missing value: the trials produced no full passes. Process failures, where the verifier never produced a reward, are reported separately rather than silently converted into model zeros.
Task construction
CVEGym sources candidates from OSV, GHSA, distro advisories, and OSS-Fuzz-adjacent disclosures. Admission is deliberately two-stage.
First, an automated filter scores each candidate for recency, likely memorization risk, textbook-CWE shape, advisory leakage, ecosystem fit, and whether the bug has a compensating signal such as cross-file reasoning, a non-grepable root cause, or a fix that the advisory does not imply. Second, a maintainer reviews the survivors and admits only tasks whose oracle can be made reliable.
The filter narrows the pool; it does not auto-admit. A task can trip one flag and still belong in the panel when the compensating signal is strong. A recent path-traversal issue caused by a backslash-normalization discrepancy, for example, is not the same evaluation object as a textbook "join path, forget to sanitize" CWE-22 exercise. The panel is curated for difficulty, oracle quality, and contamination resistance rather than raw count.
Each task carries:
- A
splittag (almost every task istraintoday;valis reserved for a held-out evaluation set as the corpus grows). - A
complexity_scorefrom 1 to 10, with free-text complexity notes. - A canonical reproducer, a gold-patch reference, and an oracle audit trail.
- A privacy status that decides whether the advisory identity is public or redacted.
Every task environment is public open-source code. No customer code, no proprietary repos.
The board is small and hand-curated on purpose. Scale-first benchmarks can scrape thousands of advisories; CVEGym trades that volume for deeper validation and oracles built to resist benchmark-specific shortcuts.
Native and ported tasks
Every task is one of two kinds.
- Native tasks are the bug as it shipped upstream, fixed in its original repository. The main defense is freshness, plus identity withholding until model cutoff dates and per-task disclosure dates can be published. The older a native task gets, the stronger its compensating signal has to be to stay in the panel. The v1.0 board has 43 native tasks.
- Ported tasks transplant the shape of a real CVE into a different, analogous function in another public repository. The bug behaves the same way, but the code the model sees is code the CVE was never disclosed in. The vulnerable variant itself is never merged upstream or published anywhere; the base repo is public OSS, but the exact tree under test exists only inside the eval. A model that memorized the original advisory and patch gains little. The v1.0 board has 6 ported tasks.
A ported task has no upstream maintainer test at the transplanted site, so its security oracle is a hand-written structural checker that runs over the submitted diff. The checker has to test the security property without turning into a patch-string lookup.
We withhold task identities, not the code. The benchmark is built entirely on public OSS, but most of the board is private: only 6 of the 49 v1.0 tasks publish their advisory id, repository, and gold patch. The other 43 run identity-withheld: every model's per-task score is still published, but the task appears under an opaque task-<hash> id with the advisory, repo, and patch redacted. This stops a leaderboard reader from reconstructing the private set and optimizing against it.
Contamination threat model
The threat model here is ordinary internet reality, not provider malice. Public vulnerabilities get indexed, discussed, patched, mirrored, and eventually absorbed into training corpora; no bad faith required. That is why the benchmark never leans on a provider's word that a task was excluded from training.
The defense has four layers.
- Recency weighting. The candidate filter favors disclosures inside a 30-day freshness window when possible. Some older tasks remain when porting or another strong compensating signal makes memorization unlikely to help. The current class sweep marks all 49 tasks as
training_data_likely=false; the per-task contamination view stays withheld until model cutoff dates and per-task disclosure dates are published. - Porting. Ported tasks move the vulnerability shape into code where the original CVE was never disclosed. This is the layer that keeps older, high-value bug classes usable after recency stops protecting them.
- Disclosure linting. Prompt variants are audited for leakage. At the
classtier, a lint gate forbids advisory IDs, file paths, function names, and other handles that would let the model retrieve the upstream fix by name. - Batch rotation. Each new batch refreshes the task set. Public release labels name what readers see on the website; unpublished task-cut labels stay internal.
Contamination cannot be made impossible. It can be made visible, bounded, and more expensive to exploit than just solving the task, and that is the actual goal.
Evaluation integrity
Agent benchmarks need the same adversarial mindset as the systems they evaluate, so CVEGym treats the evaluator itself as attack surface.
- Agent-reported success is ignored. The scored artifact is the candidate patch. CVEGym applies the patch and recomputes the gates; a transcript saying "all tests pass" does not count.
- Managed runs only. DeepSource runs the sweeps. There is no public self-submission path where entrants can tune directly against the live grader.
- Private-task leak checks. Redaction happens before the client dataset is emitted. Static generation then scans the generated client bundle and prerendered task, leaderboard, and run pages for private CVE, GHSA, or RUSTSEC identifiers.
- Oracle provenance is retained. Trials carry oracle versions, oracle digests, and task checksums so methodology bugs can be traced to affected rows.
- Drift is explicit. The current class sweep records version metadata for all 49 tasks and reports zero task-version drift in that batch.
- Retractions stay visible. If a provider route changes, a grader defect is found, contamination evidence appears, or a run used a non-standard harness, the row is struck in place rather than silently deleted.
None of this makes the benchmark ungameable. It makes the attack surface explicit and auditable, so the published number never has to be taken on faith.
Verification
DeepSource runs every evaluation. There is no self-submission path; managed runs are scheduled directly. get in touch
- Each model runs N trials per task, aggregated rather than best-of-N; the v1.0 board uses N = 3 across the 8-model, 49-task class sweep described above.
- Pass@1 is pooled over all attempts and shown with a 95% Wilson interval — aggregate behavior, never a cherry-picked best run. Task-solved counts are tracked separately: a model solves a task when at least one attempt passes.
- A trial is scoreable only when the verifier produced an oracle reward. Timeouts before verifier execution, cancelled runs, and transient API failures are process outcomes, not model failures.
- Harnesses are the ones described above — vendor-native CLIs for frontier models, the shared
mini-swe-agentscaffold for open-weight models — held fixed per model for the whole sweep.
The disclosure ladder is blind, area, class, behavior, subsystem, ordered by how much the prompt reveals. The same oracle grades every tier; only the prompt changes, so a future tier-to-tier curve can isolate how much localization a model needs to land the same fix. v1.0 reports the class tier: the model is told the bug class, but not the advisory, file, function, location, or fix. Broader disclosure sweeps follow.
- 0blind
Only that a vulnerability of some kind exists. harness-synthesized
“This project has a security bug. Find and fix it.”
- 1area
The broad functional area, no class. harness-synthesized
“…in how the project parses untrusted input.”
- 2class
The CWE class only: no advisory ID, no file, no function.
“…a CWE-22 path-traversal vulnerability.”
- 3behavior
The class plus the observable misbehavior.
“…lets a crafted path escape the intended root.”
- 4subsystem
The class, behavior, and the region of code at fault.
“…in the request-path normalization routine.”
The benchmark design supports all five tiers; v1.0 reports the class tier while broader disclosure sweeps follow. The oracle is identical across tiers, so a model's tier-to-tier curve measures how much localization it needs to land the same fix. Only class, behavior, and subsystem are author-written; the blind and area extremes are synthesized by the harness. The class tier is the contamination-sensitive one: a lint gate forbids advisory IDs, file paths, and function names there, so a task cannot be solved by retrieving the upstream fix by name alone.
Cost caps and retractions
Cost cap per trial. Caps are set on the CVEGym side and published in each batch's methodology footer. A trial that exhausts its cap counts as a failure in aggregation.
Model-level retractions. When a row is struck for a provider route swap, methodology bug, contamination evidence, non-standard harness, or provider API revocation, the leaderboard preserves the row at its original rank with the model name struck through and a retracted pill carrying the reason. Each retraction also gets a /news post citing the version, reason, and date. The oracle-level audit trail lives in the eval engine.
Design decisions
This design went through several rejected versions first. The pivots that mattered:
- Reward is a vector. The public board collapses it to Pass@1 and
scalar, but the eval records every component signal. For RL and error analysis, dense feedback beats a single bit. - Static-analyzer signals came out of the public score. Early versions included analyzer-derived features. Real sweep data showed they barely discriminated passing from failing patches and sometimes suppressed otherwise clear results. The public reward now stays on the security gate, the preservation gate, and structural quality.
- Tool-call efficiency came out too. It remains useful as process telemetry, but the public scalar should not reward a model for looking efficient while missing the security property.
- The corpus follows oracle feasibility. We narrowed to Python, Rust, and C/C++ when ecosystem-specific oracle work looked too expensive, then re-expanded to npm, Go, Java, and Ruby once the two-gate pattern proved portable.
Limitations
- Harness is bundled into the result. Each model runs in its representative harness rather than one shared scaffold, so a row reflects model-plus-harness, not the model in isolation. That mirrors how the model actually ships, but it means cross-model gaps can carry harness differences too. Decomposing the two — running every model under one harness — is future work.
- Attempt-level confidence intervals. The Wilson interval treats scored attempts as binomial observations. Attempts cluster by model and task, so the interval is a compact leaderboard summary rather than a full hierarchical uncertainty model.
- Open-source only. Everything runs on public OSS, which bounds the vulnerability distribution to what shows up in public advisories.
- One disclosure axis. The current design varies prompt localization, not runtime evidence. A model that benefits from a stack trace more than from a prose hint is not isolated by v1.0.
- Trajectories are published only for exemplar tasks. The verifier records enough audit material for internal debugging and retractions. Full agent trajectories are released for the public exemplar tasks; for identity-withheld tasks they stay internal, since a trajectory would leak the repo, advisory, and fix the redaction is protecting.
Roadmap
The hand-authored v1.0 panel is deep but small. Each task costs hours of oracle, adversary, and prompt work. The next step is a second lane: a lean OSS-Fuzz variant with a sanitizer-replay oracle, auto-rendered from public reproducers, and scaled by automation rather than hand curation alone.
- native43
The bug as it shipped upstream, fixed in its original repo. Defended by recency and identity withholding. 43 tasks on the v1.0 board.
- ported6
A real bug shape transplanted into an unrelated repo, so memorizing the original advisory does not transfer. 6 tasks.
- lean · OSS-Fuzz~1,000 target
Sanitizer-replay oracle, auto-rendered from OSS-Fuzz reproducers. Breadth over depth, on a separate leaderboard.
- sanitizer co-signalASan/UBSan banner as an observability signal alongside the behavioral oracle.
- reproducer replayRun the OSS-Fuzz testcase against the patched tree. This is the lean oracle.
- lean auto-renderTemplate new OSS-Fuzz reports into tasks automatically where the reproducer is replayable.
- bare vs coachedA/B the agent harness: does mid-episode nudging change who fixes what?
The v1.0 panel is hand-authored depth, 49 tasks split between 43 native repairs and 6 ported vulnerability shapes. The lean OSS-Fuzz variant is the breadth play: a sanitizer-replay oracle auto-rendered from public reproducers, trading the disclosure ladder and FCV gate for scale. The two leaderboards stay separate; the curated panel is the defender-fidelity claim, lean is the coverage claim. Dashed bar is a target, not a shipped count.
The lean lane stays separate from the curated panel, and it is more reward-hackable by construction: it drops the disclosure ladder, the functional-preservation gate, and the adversary work that make the curated panel expensive. Its job is coverage. The defender-fidelity claim stays with the curated panel.