3 min read

Anthropic’s Claude 3 AI Release Promises to Outperform Gemini and ChatGPT

Published March 5, 2024 3:07 PM

Claude 3 scored highly in tests used to assess the performance of AI models. Photo by Nguyen Dang Hoang Nhu on Unsplash.

Key Takeaways

Anthropic has unveiled Claude 3, the latest generation of its Large Language Model.
The new model beats OpenAI’s GPT-4 and Google’s Gemini Ultra in several performance tests.
However, strong test scores don’t necessarily translate into a better user experience.

Since debuting Claude last year, Anthropic has emerged as a key player in the market for Large Language Model (LLM) services, competing with the likes of Google, Microsoft, and OpenAI.

When the firm unveiled the latest versions of its LLM on Monday, March 4, it boasted that Claude 3 outperformed both OpenAI’s GPT-4 and Google’s Gemini Ultra. But do consumers really care about abstract performance metrics?

Claude 3 Opus Beats Gemini and GPT-4 in Key Benchmarks

Following a pattern established by Google’s Gemini, the latest generation of Claude comes in 3 different model sizes known as Haiku, Sonnet and Opus.

The largest and most capable of these – Opus – is billed as a competitor to OpenAI’s GPT-4 and Google’s Gemini Ultra.

Today, we're announcing Claude 3, our next generation of AI models.

The three state-of-the-art models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—set new industry benchmarks across reasoning, math, coding, multilingual understanding, and vision. pic.twitter.com/TqDuqNWDoM

— Anthropic (@AnthropicAI) March 4, 2024

Observing that Claude 3 Opus “outperforms its peers on most of the common evaluation benchmarks for AI systems,” Anthropic said the new model “exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence.”

With the latest LLMs achieving scores of over 90% on some tests, more sophisticated testing methodologies is becoming increasingly clear. But now that LLMs have reached a point where leading models are separated by increasingly small margins, are the standard industry benchmarks still the best measure of performance?

Do Performance Metrics Really Matter?

The LLM benchmarks used by Anthropic cover coding, linguistic reasoning, math problem solving and general knowledge.

But just like academic excellence alone doesn’t guarantee functional intelligence, high test scores won’t be enough to convince users to choose one model over another. In reality, the everyday experience of interacting with Claude and its LLM rivals will be a much more important determining factor.

We really need better benchmarks for LLMs. This paper shows that open source AIs can successfully guess the answer to standard multiple choice tests used to measure AI… even if they aren’t given the question!

That suggests these tests aren’t that useful https://t.co/UTVaJtXO6y pic.twitter.com/ZKcDBRySLe

— Ethan Mollick (@emollick) March 5, 2024

In terms of real-world usability, factors such as the quality and availability of different integrations also come into play.

It’s All About Integration

Thanks to Anthropic’s agreement with Amazon Web Services (AWS), Claude is already embedded into various AWS cloud services.

Meanwhile, Microsoft has moved to integrate GPT-4 into its established product range, boosting current and future generations of Software with enhanced AI functionality.

In the end, these commercial relationships could prove more important in deciding which LLM secures a role in different fields than the respective models’ performance measured by abstract metrics.

Was this Article helpful? Yes No