Key Takeaways
According to commonly used benchmarks, frontier large language models (LLMs) have now surpassed the average human’s ability to solve mathematical problems and perform complex reasoning.
For instance, OpenAI’s o1 model recently outperformed human experts on PhD-level science questions.
However, a group of Apple researchers (Mirzadeh et al.) have recently highlighted a major flaw in the way AI performance is assessed.
By changing the phrasing of the questions just a tiny bit, leading models from OpenAI, Google, Anthropic and Meta saw their ability to answer questions correctly collapse.
Standardized AI benchmarks make it possible to compare different models’ performance. However, if AI developers only measure intelligence using a limited set of benchmarks, they risk creating models that perform exceedingly well on a finite same of predetermined tasks but flounder in the wild.
To explore the issue, Mirzadeh et al. modified the commonly used GSM8K benchmark–a set of 8,500 grade school math word problems.
The researchers found that even superficial changes such as switching names negatively impacted model performance.
When they changed the values, performance dropped more notably. The most significant decrease occurred when they rephrased the question entirely. For example, adding a single irrelevant clause caused performance to decline by up to 65%.
Interestingly, the researchers observed this “fragility of mathematical reasoning” across all models they tested, including so-called chain-of-thought (CoT) models like OpenAI’s o1 that are meant to be capable of complex reasoning.
Chain-of-thought first emerged as a form of prompt engineering that breaks down complex prompts into a series of intermediate steps.
Although the technique was honed as an additional stage developers could apply to LLM prompts, some models now incorporate CoT into their architecture.
With CoT baked in, OpenAI’s o1 is much more capable of complex reasoning than its predecessors. The model’s lead developer Lukasz Kaiser has argued that the new design approach represents a shift for LLMs that will lead to more concrete logical processes.
Yet, for all its apparent advancements, o1 was subject to the same fragile reasoning the Apple researchers observed in other models.
Despite major performance gains, the researchers concluded that even the most sophisticated LLM operations “resemble sophisticated pattern matching more than true logical reasoning”.
Nevertheless, their findings do suggest that CoT-based approaches are moving in the right direction.
Of all the models assessed, o1 experienced the smallest performance decline between the regular GSM8K questions and the modified ones. In other words, although its reasoning was found to be fragile, it is less fragile than other models.