Home / News / Technology / Apple Researchers Suggest ‘Fragile’ AI Reasoning Capabilities Are Overstated
Technology
3 min read

Apple Researchers Suggest ‘Fragile’ AI Reasoning Capabilities Are Overstated

Published
James Morales
Published

Key Takeaways

  • Apple researchers have argued that performance benchmarks may overstate AI reasoning capabilities.
  • In an experiment, they found that even minor changes to benchmark questions resulted in significantly worse performance.
  • This suggests that AI models rely on “sophisticated pattern matching more than true logical reasoning” they concluded.

According to commonly used benchmarks, frontier large language models (LLMs) have now surpassed the average human’s ability to solve mathematical problems and perform complex reasoning.

For instance, OpenAI’s o1 model recently outperformed human experts on PhD-level science questions.

However, a group of Apple researchers (Mirzadeh et al.) have recently highlighted a major flaw in the way AI performance is assessed.

By changing the phrasing of the questions just a tiny bit, leading models from OpenAI, Google, Anthropic and Meta saw their ability to answer questions correctly collapse.

The Limitations of AI Benchmarks

Standardized AI benchmarks make it possible to compare different models’ performance. However, if AI developers only measure intelligence using a limited set of benchmarks, they risk creating models that perform exceedingly well on a finite same of predetermined tasks but flounder in the wild. 

To explore the issue, Mirzadeh et al. modified the commonly used GSM8K benchmark–a set of 8,500 grade school math word problems.

The researchers found  that even superficial changes such as switching names negatively impacted model performance. 

When they changed the values, performance dropped more notably. The most significant decrease occurred when they rephrased the question entirely. For example, adding a single irrelevant clause caused performance to decline by up to 65%.

Interestingly, the researchers observed this “fragility of mathematical reasoning” across all models they tested, including so-called chain-of-thought (CoT) models like OpenAI’s o1 that are meant to be capable of complex reasoning

The Rise of Chain-of-Thought

Chain-of-thought first emerged as a form of prompt engineering that breaks down complex prompts into a series of intermediate steps.

Although the technique was honed as an additional stage developers could apply to LLM prompts, some models now incorporate CoT into their architecture.

With CoT baked in, OpenAI’s o1 is much more capable of complex reasoning than its predecessors. The model’s lead developer Lukasz Kaiser has argued that the new design approach represents a shift for LLMs that will lead to more concrete logical processes. 

Yet, for all its apparent advancements, o1 was subject to the same fragile reasoning the Apple researchers observed in other models.

AI Still Incapable of Formal Reasoning

Despite major performance gains, the researchers concluded that even the most sophisticated LLM operations “resemble sophisticated pattern matching more than true logical reasoning”.

Nevertheless, their findings do suggest that CoT-based approaches are moving in the right direction. 

Of all the models assessed, o1 experienced the smallest performance decline between the regular GSM8K questions and the modified ones. In other words, although its reasoning was found to be fragile, it is less fragile than other models.

Was this Article helpful? Yes No

James Morales

Although his background is in crypto and FinTech news, these days, James likes to roam across CCN’s editorial breadth, focusing mostly on digital technology. Having always been fascinated by the latest innovations, he uses his platform as a journalist to explore how new technologies work, why they matter and how they might shape our future.
See more