Why SWE-bench Verified now not measures frontier coding capabilities

Since we first revealed SWE-bench Verified in August 2024, the trade has extensively used it to measure the progress of fashions on autonomous software program engineering duties. After its launch, SWE-bench Verified offered a powerful sign of functionality progress and have become a regular metric reported in frontier mannequin releases. Monitoring and forecasting progress of those capabilities can also be an necessary a part of OpenAI’s Preparedness Framework. Once we created the Verified benchmark initially, we tried to unravel points within the unique analysis that made sure duties inconceivable to perform within the SWE-bench dataset⁠(opens in a brand new window).

After preliminary leaps, state-of-the-art progress on SWE-bench Verified has slowed, bettering⁠(opens in a brand new window) from 74.9% to 80.9% within the final 6 months. This raises the query: do the remaining failures replicate mannequin limitations or properties of the dataset itself?

In a brand new evaluation, we discovered two main points with the Verified set that point out the benchmark is now not appropriate for measuring progress on autonomous software program engineering capabilities for frontier launches at in the present day’s efficiency ranges:

Exams reject right options: We audited a 27.6% subset of the dataset that fashions typically failed to unravel and located that at the very least 59.4% of the audited issues have flawed check circumstances that reject functionally right submissions, regardless of our greatest efforts in bettering on this within the preliminary creation of SWE-bench Verified.
Coaching on options: As a result of massive frontier fashions can be taught data from their coaching, it will be important that they’re by no means skilled on issues and options they’re evaluated on. That is akin to sharing issues and options for an upcoming check with college students earlier than the check – they might not memorize the reply however college students who’ve seen the solutions earlier than will definitely do higher than these with out. SWE-bench issues are sourced from open-source repositories many mannequin suppliers use for coaching functions. In our evaluation we discovered that each one frontier fashions we examined had been capable of reproduce the unique, human-written bug repair used because the ground-truth reference, generally known as the gold patch, or verbatim downside assertion specifics for sure duties, indicating that each one of them have seen at the very least a number of the issues and options throughout coaching.

We additionally discovered proof that fashions which have seen the issues throughout coaching usually tend to succeed, as a result of they’ve further data wanted to move the underspecified checks.

Because of this enhancements on SWE-bench Verified now not replicate significant enhancements in fashions’ real-world software program improvement talents. As a substitute, they more and more replicate how a lot the mannequin was uncovered to the benchmark at coaching time. For this reason now we have stopped reporting SWE-bench Verified scores, and we advocate that different mannequin builders achieve this too.

We’re constructing new, uncontaminated evaluations to raised monitor coding capabilities, and we expect this is a vital space to deal with for the broader analysis neighborhood. Till now we have these, OpenAI recommends reporting outcomes for SWE-bench Professional.

The unique SWE-bench⁠(opens in a brand new window) analysis was launched in 2023. Every downside is sourced from a resolved GitHub difficulty in one in all 12 open-source Python repositories and paired with the corresponding pull request (PR). To find out whether or not a model-generated code change is right, every downside comes with two units of checks:

Exams that fail on the unmodified codebase however move if the problem is appropriately fastened
Regression checks that move each earlier than and after the repair to make sure unrelated performance stays intact.

The mannequin doesn’t see the checks. It has to supply a code change given solely the unique difficulty textual content and the state of the repository earlier than the repair. It passes an issue provided that all checks move after the code change is utilized.

We discovered many points with that analysis that would result in underreporting the potential of fashions.

Some unit checks had been overly particular or misaligned with the duty so right fixes may very well be rejected.
Many activity statements had been underspecified, which might result in a number of legitimate interpretations – whereas the checks solely coated a particular one.
Relying on setup of the atmosphere (for instance Linux vs Home windows, or the python model), some checks might spuriously fail

We created SWE-bench Verified in 2024 to deal with these points. We labored with skilled software program engineers to overview 1,699 SWE-bench issues and filter out issues that had these points. Every downside was reviewed by three consultants independently. This overview course of resulted in SWE-bench Verified, a curated set of 500 issues.

Too slim and too vast checks

Whereas SWE-bench Verified is a giant enchancment over the preliminary model, residual points stay. We performed an audit of 138 SWE-bench Verified issues that OpenAI o3 didn’t persistently clear up over 64 unbiased runs. Every case was independently reviewed by at the very least six skilled software program engineers. If an skilled flagged a problem, it was re-verified by an extra group.

We discovered that 59.4% of the 138 issues contained materials points in check design and/or downside description, rendering them extraordinarily troublesome or inconceivable even for essentially the most succesful mannequin or human to unravel.

35.5% of the audited duties have strict check circumstances that implement particular implementation particulars, invalidating many functionally right submissions, which we name slim check circumstances.
18.8% of the audited duties have checks that examine for extra performance that wasn’t laid out in the issue description, which we name vast check circumstances.
The remaining 5.1% of duties had miscellaneous points that weren’t effectively grouped with this taxonomy.

An illustrative instance of the primary failure mode is pylint-dev__pylint-4551⁠(opens in a brand new window), the place the PR introduces a brand new operate `get_annotation` as a part of the general resolution. This operate identify isn’t talked about in the issue description, however is imported straight by the checks. Whereas some fashions may intuit to create such a operate, it’s not strictly essential to implement a operate with this particular identify to appropriately handle the issue. Many legitimate options fail the checks on import errors.

PR check failures (truncated for readability)

Unique PR description (from the GitHub PR)

Drawback Description for #18212

Drawback Description for SWE-bench Verified activity (solely taken from #18212):

SWE-bench Verified and the repositories (code bases and launch notes) are each open-source and broadly used and mentioned, which makes avoiding contamination troublesome for mannequin builders.

We first encountered indicators of contamination in our personal fashions. For instance, when GPT‑5.2 solved 31 duties we recognized to be nearly inconceivable to unravel. In django__django-14725⁠(opens in a brand new window) the checks require a particular new parameter `edit_only` which isn’t explicitly required by the issue assertion. Whereas fixing the issue, GPT‑5.2 exhibits in its chain of thought that it has details about the discharge notes that element adjustments to the codebase, and appropriately identifies that the `edit_only` parameter was launched in Django 4.1.

To evaluate how vital contamination is extra broadly, we created an automatic red-teaming setup. For every SWE-bench Verified query, we tasked GPT‑5 with probing a GPT‑5.2‑Chat, Claude Opus 4.5 and Gemini 3 Flash Preview for contamination. These fashions had been chosen to exclude reasoning fashions, however we acknowledge there’s seemingly a non-trivial functionality hole between them.

To probe for contamination, GPT‑5 obtained: the SWE-bench Verified activity’s ID, description, gold patch, and PR checks. Over 15 turns, we allowed GPT‑5 to differ the system/developer immediate, consumer immediate, and assistant prefill and completely different elicitation methods. After every flip, a choose mannequin labeled how a lot novel task-specific data appeared and every response was labeled for contamination severity from “none” to “robust.” GPT‑5 was allowed to adapt its technique primarily based on prior turns to iteratively get well task-specific particulars. For every instance of robust contamination, we verified with one other choose that GPT‑5 didn’t leak an excessive amount of data to the goal mannequin. Lastly, we then manually reviewed the “robust” examples that make up the transcripts on this put up.

Under are examples of robust contamination throughout completely different mannequin suppliers.

Given a brief snippet from the duty description, GPT‑5.2 outputs the precise gold patch. Particularly, it is aware of the precise class and technique identify, and the brand new early return situation `if username is None or password is None` that’s launched.

Contamination elicitation

Opus is ready to not solely recall the precise 4-line purposeful change the PR launched, together with the precise filename and technique that it touched, but additionally quotes verbatim the inline remark that was a part of the diff.

Contamination elicitation

Gemini 3 Flash, when given no additional data concerning the duty moreover the ID, is ready to output verbatim particulars from the duty description and the gold patch. This contains the brand new regex components for username validation and the precise line numbers for the change.

Contamination elicitation

From this audit of SWE-bench Verified, we see two broader classes for analysis design. First, benchmarks sourced from publicly obtainable materials carry contamination threat, the place training-data publicity can silently inflate scores. If publicly crawled information is utilized in benchmark building, mannequin builders ought to carry out further checks for contamination. Benchmarks, and even their options, posted publicly can find yourself in coaching information. Additional care must be taken each in how datasets are posted (i.e. password protected) and coaching information filtering (i.e. strict adherence to canary strings).

Second, automated scoring is hard to get proper; good check circumstances ought to totally confirm right performance, being each agnostic to particular unimportant implementation particulars and likewise strong to shortcut options. These issues are inherently complicated and troublesome to unravel. Catching these issues took a number of intensive human labeling campaigns.

We’ve got included these findings into our latest analysis efforts. Within the final months we’ve chosen to report outcomes from the general public break up of SWE-Bench Professional. We advocate different mannequin builders do the identical. SWE-bench Professional isn’t good, however empirically appears to undergo much less from contamination points. Our contamination pipeline discovered some circumstances of contamination, however these circumstances had been considerably rarer and fewer egregious than SWE-bench Verified, and no mannequin was capable of produce an entire verbatim gold patch.

We are going to proceed to put money into unique, privately authored benchmarks and ask for assist from the trade and academia to do the identical. In GDPVal⁠, duties are privately authored by area consultants, lowering publicity threat, and options are graded holistically by skilled reviewers. This method is resource-intensive, however more and more essential to measure real functionality enhancements.

Source link

Article Tags:

Article Categories:

Water Purifiers & Accessories

Why SWE-bench Verified now not measures frontier coding capabilities

Too slim and too vast checks

PR check failures (truncated for readability)

Unique PR description (from the GitHub PR)

Drawback Description for #18212

Drawback Description for SWE-bench Verified activity (solely taken from #18212):

Contamination elicitation

Contamination elicitation

Contamination elicitation

Leave a Reply Cancel reply

Bitcoin Mining Issue Falls Barely in Newest Adjustment

Aave’s TVL Falls $8B After $293M Kelp DAO Hack