Home / News / Technology / AI / OpenAI Accused of Manipulating Benchmark Results as Chinese Models Close AI Performance Gap
AI
3 min read

OpenAI Accused of Manipulating Benchmark Results as Chinese Models Close AI Performance Gap

Published
James Morales
Published
By James Morales
Edited by Insha Zia

Key Takeaways

  • It was recently revealed that OpenAI secretly funded and accessed data related to the FrontierMath AI benchmark.
  • The controversy raises questions about the legitimacy of OpenAI’s performance claims.
  • Emerging Chinese models are increasingly threatening OpenAI’s supremacy.

In the ChatGPT era, OpenAI has consistently topped AI leaderboards, with each new model raising the bar in areas such as mathematical reasoning, coding, and long-context understanding.

However, rivals are closing the gap, especially new models emerging from China that have clocked some impressive benchmark scores.

Amid rising competition, OpenAI has come under fire for secretly funding and accessing a benchmarking dataset, raising questions over the legitimacy of of its performance claims.

The FrontierMath Controversy

The benchmark at the heart of the controversy is FrontierMath—a set of hundreds of difficult math problems specifically designed to push the limits of what is possible with AI.

FrontierMath was created as a way to challenge frontier AI models that already achieve near-perfect scores on easier tests.

OpenAI commissioned the new benchmark. However, researchers, including mathematicians who contributed to the project, were kept in the dark about the company’s involvement until recently.

Worse still, OpenAI had access to all but 50 of the problems in the dataset. However, the company left that detail out when it boasted about its latest model’s 25% score on FrontierMath.

Rigged Benchmark Results?

EpochAI, the research institute that developed the benchmark, has sought to pass the blam e on to OpenAI, highlighting clauses in the contract that prevented full disclosure.

Moreover, Epoch director Tamay Besiroglu insisted  that OpenAI made a “verbal agreement” not to use FrontierMath problems for AI training.

However, given that OpenAI’s o3 reportedly scored more than ten times higher than other leading models, many are now suspicious of those results.

Crucially, the furor over FrontierMath comes at a time when OpenAI’s models no longer dominate AI rankings as they once did.

Ascendant Chinese AI

Alongside traditional rivals like Google and Meta, a crop of upstart Chinese players are now snapping at OpenAI’s heels.

On Monday, Jan. 20, Hangzhou-based DeepSeek released a new open-source model, R1, that it claims  beats OpenAI’s o1 on the AIME, MATH-500, and SWE-bench benchmarks.

Then, on Wednesday, ByteDance unveiled Doubao 1.5 Pro , a new language model that delivers GPT-4o-level performance with just a fraction of the computing power.

Gone are the days when Chinese AI models were little more than cheap knockoffs of their American counterparts.

Doubao 1.5 Pro and DeepSeek R1 prove that OpenAI’s technological supremacy is far from secured. And if the company’s AI crown is usurped, it may not be an American rival that does it.

Was this Article helpful? Yes No

James Morales

Although his background is in crypto and FinTech news, these days, James likes to roam across CCN’s editorial breadth, focusing mostly on digital technology. Having always been fascinated by the latest innovations, he uses his platform as a journalist to explore how new technologies work, why they matter and how they might shape our future.
See more