Key Takeaways
Launched on Thursday, Sept. 12, OpenAI’s latest model, o1, excels at complex tasks, particularly in science and mathematics.
However, some user reviews say the new model often fails at basic reasoning challenges.
One instance of o1 failing in the kind of reasoning it is meant to excel at that has circulated online takes the form of a logic challenge.
In an example shared on X, the correct answer to the problem is 3-8-4-1, yet o1 confidently asserted to the user that 3-2-1-8 meets the criteria. When CCN repeated the experiment, the chatbot output the wrong answer two out of three times.
In defense of o1, GPT-4o, Gemini, and Claude also failed the challenge, each generating a different response.
Yet OpenAI boasts that the new model “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology.” So why can’t it solve a problem that took a journalist with a pencil and paper and no PhD a couple of minutes to solve?
To test the o1’s capabilities further, CCN tested it with a range of math and logic challenges. The conclusion: the new AI model excels at the kind of algebra and calculus questions you might find in high school exams. However, it often struggles with linguistic reasoning and problems that require lateral thinking.
For example, when prompted with a classic bag of marbles puzzle , o1 found the answer within a matter of seconds. But it failed to crack the procession of ducks riddle.
In other words, OpenAI’s claim of PhD-level intelligence should be taken with a pinch of salt.
The problem lies in the way AI models are evaluated.
In many of the most commonly used AI reasoning benchmarks, o1 has crushed the competition. For instance, it achieved 83% in a qualifying exam for the International Mathematics Olympiad that GPT-4o only scored 13% on. But AI has been beating human-level performance on these benchmarks for some time.
Perhaps the best way to think of o1’s enhanced reasoning capabilities is as a sophisticated calculator.
In the hands of someone who already understands the problems that need solving, it is an incredibly powerful tool. However, for the kinds of real-world reasoning challenges that ChatGPT users face every day, the improvement may not feel so profound.