Home / News / Technology / OpenAI’s o1 Model Can Solve Complex Science Problems but Struggles With Basic Tasks
Technology
3 min read

OpenAI’s o1 Model Can Solve Complex Science Problems but Struggles With Basic Tasks

Published
James Morales
Published

Key Takeaways

  • OpenAI has launched its latest AI model, o1.
  • The new model performs significantly better than previous generations on complex reasoning.
  • However, it still has limitations.

Launched on Thursday, Sept. 12, OpenAI’s latest model, o1, excels at complex tasks, particularly in science and mathematics.

However, some user reviews say the new model often fails at basic reasoning challenges. 

O1 Fails To Crack the Code

One instance of o1 failing in the kind of reasoning it is meant to excel at that has circulated online takes the form of a logic challenge.

In an example shared  on X, the correct answer to the problem is 3-8-4-1, yet o1 confidently asserted to the user that 3-2-1-8 meets the criteria. When CCN repeated the experiment, the chatbot output the wrong answer two out of three times.

ChatGPT number problem
Source: ChatGPT.

In defense of o1, GPT-4o, Gemini, and Claude also failed the challenge, each generating a different response. 

Yet OpenAI boasts  that the new model “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology.” So why can’t it solve a problem that took a journalist with a pencil and paper and no PhD a couple of minutes to solve

Paradigm Shift or Marketing Spin?

To test the o1’s capabilities further, CCN tested it with a range of math and logic challenges. The conclusion: the new AI model excels at the kind of algebra and calculus questions you might find in high school exams. However, it often struggles with linguistic reasoning and problems that require lateral thinking.

For example, when prompted with a classic bag of marbles puzzle , o1 found the answer within a matter of seconds. But it failed to crack the procession of ducks  riddle.

In other words, OpenAI’s claim of PhD-level intelligence should be taken with a pinch of salt.

The problem lies in the way AI models are evaluated. 

In many of the most commonly used AI reasoning benchmarks, o1 has crushed the competition. For instance, it achieved 83% in a qualifying exam for the International Mathematics Olympiad that GPT-4o only scored 13% on. But AI has been beating human-level performance on these benchmarks for some time.

An Intelligent Calculator

Perhaps the best way to think of o1’s enhanced reasoning capabilities is as a sophisticated calculator. 

In the hands of someone who already understands the problems that need solving, it is an incredibly powerful tool. However, for the kinds of real-world reasoning challenges that ChatGPT users face every day, the improvement may not feel so profound.

Was this Article helpful? Yes No