Home / News / Technology / AI Outpaces Human Abilities: Stanford Report Urges Development of New Evaluation Benchmarks 
4 min read

AI Outpaces Human Abilities: Stanford Report Urges Development of New Evaluation Benchmarks 

Last Updated April 16, 2024 3:18 PM
James Morales
Last Updated April 16, 2024 3:18 PM

Key Takeaways

  • The 2024 edition of Stanford University’s annual Artificial Intelligence Index found that AI has surpassed human capabilities in a number of areas.
  • The most advanced AI models have now surpassed the human baseline on benchmarks for image classification, basic reading comprehension and visual instruction following.
  • However, AI is still behind humans in math problem-solving and visual commonsense reasoning.
  • Now, researchers have added a new challenge to that list – visual instruction following

The 2024 edition  of Stanford University’s annual Artificial Intelligence Index has made headlines for confirming what many of us already suspected: AI is now more capable of performing certain intellectual tasks than the average human.

With the latest AI systems outperforming the human baseline in critical areas such as image classification, visual reasoning, and English understanding, the report highlighted the need for new benchmarks to assess the capabilities of future models.

AI Surpasses Human Capabilities

Since 2024, Stanford’s AI Index has used a set of standardized benchmarks to measure the performance of AI systems in 9 separate areas.

For example, the SuperGLUE benchmark is used to assess language understanding. Meanwhile, mathematical problem-solving ability is measured using MATH – a set of 12, 500 problems from high school math competitions.

Source: Stanford 2024 AI Index Report.

The report notes that over the years, AI has surpassed human baselines on a handful of benchmarks: image classification in 2015, basic reading comprehension in 2017, visual reasoning in 2020, and natural language inference in 2021.

Now, researchers have added a new challenge to that list – visual instruction following as measured by VisIT-BENCH.

As AI has advanced, researchers have developed new benchmarks and retired old ones. But as the latest AI Index notes, benchmarking is now shifting away from computerized rankings to incorporate more human evaluations like Chatbot Arena.

New Directions in AI Evaluation

Just like universities tend not to put post-graduate students through examinations, increasingly sophisticated AI agents are outgrowing traditional benchmarking methods. 

Chatbot Arena acknowledges that controlled test environments don’t always mirror the real world. Instead, it ranks models according to crowdsourced user evaluations.

Interestingly, this approach has demonstrated that high test scores don’t directly translate into a better user experience. For example, even though it scores lower in several traditional AI benchmarks, Claude 3 has now overtaken GPT-4 on Chatbot Arena

In yet another take on AI evaluation, researchers hope to approximate challenges AI agents may face in the wild using video games. 

One novel benchmark proposed by researchers at the University of the Witwatersrand uses Minecraft to assess AI performance. 

Meanwhile, the team behind Google’s Scalable Instructable Multiworld Agent (SIMA) including Goat Simulator 3, No Man’s Sky and Valheim to train and evaluate their agent’s performance.

However, the 2024 AI Index highlights one important area of AI benchmarking where more work is needed.

Lack of Standardized Benchmarks for AI Safety

From sexually explicit AI deepfakes to incidents involving self-driving cars, AI safety concerns are rising. 

To measure the safety of different AI models, benchmarks have been developed that assess the bias, truthfulness and toxicity of outputs

Yet, the 2024 AI Index found that there is a “significant lack of standardization” in the way AI developers assess systems for safety.

AI safety benchmarks
Source: Stanford 2024 AI Index Report

“Leading developers, including OpenAI, Google, and Anthropic, primarily test their models against different responsible AI benchmarks,” the report notes. “This practice complicates efforts to systematically compare the risks and limitations of top AI models.”

For example, the report noted that none of the major AI developers reported results for all 5 of the most commonly used responsibility benchmarks, with Mistral not using a single one.

Was this Article helpful? Yes No