Home / News / Technology / OpenAI’s o1 Model Can Solve Complex Science Problems but Struggles With Basic Tasks
Technology
3 min read

OpenAI’s o1 Model Can Solve Complex Science Problems but Struggles With Basic Tasks

Published
James Morales
Published

Key Takeaways

  • OpenAI has launched its latest AI model, o1.
  • The new model performs significantly better than previous generations on complex reasoning.
  • However, it still has limitations.

Launched on Thursday, Sept. 12, OpenAI’s latest model, o1, excels at complex tasks, particularly in science and mathematics.

However, some user reviews say the new model often fails at basic reasoning challenges. 

O1 Fails To Crack the Code

One instance of o1 failing in the kind of reasoning it is meant to excel at that has circulated online takes the form of a logic challenge.

In an example shared on X, the correct answer to the problem is 3-8-4-1, yet o1 confidently asserted to the user that 3-2-1-8 meets the criteria. When CCN repeated the experiment, the chatbot output the wrong answer two out of three times.

ChatGPT number problem
Source: ChatGPT.

In defense of o1, GPT-4o, Gemini, and Claude also failed the challenge, each generating a different response. 

Yet OpenAI boasts that the new model “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology.” So why can’t it solve a problem that took a journalist with a pencil and paper and no PhD a couple of minutes to solve

Paradigm Shift or Marketing Spin?

To test the o1’s capabilities further, CCN tested it with a range of math and logic challenges. The conclusion: the new AI model excels at the kind of algebra and calculus questions you might find in high school exams. However, it often struggles with linguistic reasoning and problems that require lateral thinking.

For example, when prompted with a classic bag of marbles puzzle , o1 found the answer within a matter of seconds. But it failed to crack the procession of ducks riddle.

In other words, OpenAI’s claim of PhD-level intelligence should be taken with a pinch of salt.

The problem lies in the way AI models are evaluated. 

In many of the most commonly used AI reasoning benchmarks, o1 has crushed the competition. For instance, it achieved 83% in a qualifying exam for the International Mathematics Olympiad that GPT-4o only scored 13% on. But AI has been beating human-level performance on these benchmarks for some time.

An Intelligent Calculator

Perhaps the best way to think of o1’s enhanced reasoning capabilities is as a sophisticated calculator. 

In the hands of someone who already understands the problems that need solving, it is an incredibly powerful tool. However, for the kinds of real-world reasoning challenges that ChatGPT users face every day, the improvement may not feel so profound.

Was this Article helpful? Yes No
James Morales is CCN’s blockchain and crypto policy reporter. He has been working in the news media since 2020, writing about topics such as payments, banking and financial technology. These days, he likes to explore the latest blockchain innovations and the evolving landscape of global crypto regulation. With an educational background in social anthropology and media studies, James uses his platform as a journalist to explore how new technologies work, why they matter and how they might shape our future.
See more
loading
loading