Key Takeaways
Since debuting Claude last year, Anthropic has emerged as a key player in the market for Large Language Model (LLM) services, competing with the likes of Google, Microsoft, and OpenAI.
When the firm unveiled the latest versions of its LLM on Monday, March 4, it boasted that Claude 3 outperformed both OpenAI’s GPT-4 and Google’s Gemini Ultra. But do consumers really care about abstract performance metrics?
Following a pattern established by Google’s Gemini, the latest generation of Claude comes in 3 different model sizes known as Haiku, Sonnet and Opus.
The largest and most capable of these – Opus – is billed as a competitor to OpenAI’s GPT-4 and Google’s Gemini Ultra.
Observing that Claude 3 Opus “outperforms its peers on most of the common evaluation benchmarks for AI systems,” Anthropic said the new model “exhibits near-human levels of comprehension and fluency on complex tasks, leading the frontier of general intelligence.”
With the latest LLMs achieving scores of over 90% on some tests, more sophisticated testing methodologies is becoming increasingly clear. But now that LLMs have reached a point where leading models are separated by increasingly small margins, are the standard industry benchmarks still the best measure of performance?
The LLM benchmarks used by Anthropic cover coding, linguistic reasoning, math problem solving and general knowledge.
But just like academic excellence alone doesn’t guarantee functional intelligence, high test scores won’t be enough to convince users to choose one model over another. In reality, the everyday experience of interacting with Claude and its LLM rivals will be a much more important determining factor.
In terms of real-world usability, factors such as the quality and availability of different integrations also come into play.
Thanks to Anthropic’s agreement with Amazon Web Services (AWS), Claude is already embedded into various AWS cloud services.
Meanwhile, Microsoft has moved to integrate GPT-4 into its established product range, boosting current and future generations of Software with enhanced AI functionality.
In the end, these commercial relationships could prove more important in deciding which LLM secures a role in different fields than the respective models’ performance measured by abstract metrics.