Why is independently verifying AI benchmark scores so difficult?

AI labs publish scores but rarely publish what's needed to reproduce them — the tools provided, agent harness design, and exact prompt format all affect outcomes significantly.

When a lab publishes a benchmark score, they typically disclose the dataset name, the metric, and the number. They rarely disclose the full evaluation setup — the specific tools available to the model, the agent scaffolding that orchestrates the model's actions, the exact system prompt, and the retry logic. On coding benchmarks like SWE Verified, where an AI agent must autonomously fix real software bugs, these setup choices aren't minor. They're the benchmark.

An agent that receives well-matched tools from its training distribution performs differently than the same model running with generic tools or no tools at all. A harness that formats messages in the model's native training format produces different outcomes than one that routes through a generic API endpoint. These aren't bugs — they're properties of how large language models work. But they make independent reproduction very hard when the original evaluation doesn't document them.

The result is a verification gap. Labs publish numbers. Other researchers can't reproduce them without reverse engineering the setup. Most don't try. This means AI benchmark scores circulate widely, get cited in product comparisons and deployment decisions, but rest on setups that no one outside the lab has confirmed. That's the gap Borislav Mavrin set out to close for gpt-oss-20b.

What is gpt-oss-20b and what scores did OpenAI claim for it?

gpt-oss-20b is an open-weights model that OpenAI released alongside benchmark comparisons — but without publishing the tools or agent harness used in those evaluations.

OpenAI published the following benchmark scores for gpt-oss-20b with tools: 60.7% on SWE Verified HIGH, 53.2% on SWE Verified MEDIUM, and 90.4% on AIME25 with tools. SWE Verified is a benchmark where an agent receives a real GitHub issue from a curated set of software repositories and must autonomously produce a code patch that fixes the bug and passes all tests. HIGH and MEDIUM refer to verified difficulty levels in the dataset. AIME25 is the 2025 American Invitational Mathematics Examination.

These scores were notable. A 60.7% on SWE Verified HIGH would place gpt-oss-20b among the best-performing models on that benchmark. The scores fed comparisons, research discussions, and deployment evaluations across the community. But they came with a catch: OpenAI's original paper disclosed neither the tools provided to the model nor the agent harness structure used during evaluation. Without those two things, no outside researcher could confirm the numbers.

How did Mavrin actually reproduce the results?

He reverse-engineered the model's in-distribution tool behavior and built a native agent harness — called Harmony — that bypasses the lossy conversion that standard API wrappers introduce.

Mavrin's first challenge was the tool problem. Without knowing which tools OpenAI gave gpt-oss-20b during evaluation, reproducing the setup seemed impossible. His solution: probe the model itself. When prompted without any tool definitions, gpt-oss-20b still calls tools — it uses names and formats from its training distribution "with high statistical confidence," in the paper's phrasing. This isn't hallucination in the standard sense. It's a model calling tools it was trained to expect because the prompt structure signals that it's in a tool-use context. By identifying which tools the model consistently invoked, Mavrin could infer the approximate tool set OpenAI used during evaluation.

This in-distribution tool calling behavior is the paper's most technically significant finding, separate from the reproduction itself. It means that models trained on agent data can exhibit deterministic-looking tool use without any tool context in the prompt — a property that has real implications for benchmark design and for anyone trying to understand a model's behaviour in deployment.

The second challenge was the harness. The standard path for running gpt-oss-20b is through OpenAI's Chat Completions API, which converts messages into a unified format. Mavrin identified that this conversion is lossy — it loses formatting information that affects model behaviour. He built Harmony, a custom agent harness that encodes messages in the model's native training format, bypassing the API conversion layer. The harness, which is open-source at github.com/borislavmavrin/harmonyagent, allows other researchers to run the same evaluation setup.

With tools identified and the native harness built, Mavrin ran the evaluations. The results came back within fractions of one percentage point of OpenAI's published scores.

Benchmark OpenAI Published Mavrin Reproduced Difference
SWE Verified HIGH 60.7% 60.4% −0.3 pp
SWE Verified MEDIUM 53.2% 53.3% +0.1 pp
AIME25 with tools 90.4% 91.7% +1.3 pp

Source: arXiv 2604.00362 — "In harmony with gpt-oss" by Borislav Mavrin, April 1, 2026.

What do the matched scores mean — and what would mismatches have meant?

Matched scores within this margin mean OpenAI's benchmark methodology for gpt-oss-20b was solid — the numbers represent what the model can actually do under approximately those conditions.

The SWE Verified HIGH difference is 0.3 percentage points. The MEDIUM difference is 0.1 points. The AIME25 result actually comes in higher than OpenAI published — 91.7% versus 90.4%. None of these are meaningful deviations. Benchmark results vary run-to-run due to model temperature, test ordering, and minor infrastructure differences. A 0.3 pp gap is within expected replication noise.

What would a mismatch have meant? That depends on direction and magnitude. A reproduced score substantially lower than the published number — say, 40% instead of 60% — would suggest cherry-picking, favourable test conditions, or a harness specifically optimised to inflate performance. A higher reproduced score would suggest the lab published conservative numbers. Neither happened here. The reproduction landed squarely on the claimed values, which validates the claim.

This matters for trust, not just methodology. AI benchmarks are the primary mechanism through which researchers, buyers, and policymakers compare models. If the scores don't hold up under independent reproduction, the decisions made on the basis of those scores are made on bad information. The fact that these scores held up is a genuine positive signal — at least for this model, on these benchmarks, OpenAI's published numbers are real.

What does this tell us about AI accountability and reproducibility?

It shows independent verification is possible — but it took a researcher months of reverse engineering to do what labs should be enabling routinely through standardised disclosure of evaluation setups.

Mavrin's paper describes this as "the first independent reproduction" of OpenAI's published scores for any production-accessible model. That framing says something important about where the field is. The machinery exists — benchmarks, models, compute. What's missing is the norm. Labs publish scores without publishing the full reproduction package: no tool specifications, no harness code, no prompt templates, no sampling parameters. This makes verification possible in theory and impractical in fact.

The academic analogy is direct: journals require methodology sections precisely because "I ran the experiment and got this number" isn't science — the number needs to be reproducible by someone else. AI benchmark claims function as scientific findings when they drive research, product decisions, and policy. The standard for what gets disclosed should match that weight.

There's also a structural incentive problem. Labs are competing on benchmarks. Publishing everything needed to reproduce your scores gives competitors a roadmap. In the short term, opacity protects advantage. In the long term, it erodes the credibility of the entire benchmark ecosystem — including the scores that favour the same labs that resist disclosure.

Nexairi Analysis: If It Can Be Done, It Should Be Required

The Mavrin paper's most important finding isn't the reproduction gap. It's that the reproduction was possible at all, given how little OpenAI disclosed. One researcher, working independently, figured out the tools and the harness and got the right answer. That's a testament to Mavrin's technical skill — but it also reveals how thin the barrier to transparency actually is.

If a single researcher can reproduce scores through reverse engineering in weeks, a lab can document them in an afternoon. The effort required to publish a harness template and a tool schema is an afternoon of documentation work, not a competitive moat. The reason it doesn't happen isn't cost. The reason is that the incentive structure doesn't require it.

National AI safety bodies — the UK AI Safety Institute and the US AI Safety Institute — are beginning to build independent evaluation capacity. But they can't evaluate at the pace of model releases, and their access depends on cooperation from the labs. Structural benchmarking transparency — required disclosure of evaluation setup as a condition of publishing benchmark claims — would do more to stabilise the information environment than any number of one-off reproduction studies. The Mavrin paper shows the goal is achievable. It also shows it shouldn't require this much effort.

Sources

OpenAI AI Benchmarks Model Evaluation Reproducibility Open Research