4 min read

AI Minecraft Challenge: Measuring Artificial Intelligence Through Gameplay

Last Updated February 16, 2024 4:27 PM

By James Morales

Minecraft could be used to test AI performance. Image by Minecraft via IGDB.com.

Key Takeaways

Researchers have developed a new way of testing AI using Minecraft.
Current AI benchmarks are limited when it comes to testing problem-solving ability.
However, the creators of “MinePlanner argue that future models will need to deal with ‘messy’ problems.

School teachers know play is often the best way to get students to think for themselves. From kindergarten onwards, educators use games to assess original thinking and problem-solving skills. Now, AI developers want to apply a similar strategy.

Researchers at the University of the Witwatersrand in South Africa recently proposed a new benchmark test that uses Minecraft to assess AI performance. Named MinePlanner, the currently unpublished benchmark anticipates the next stage of AI development, in which models will be expected to do more than understand a question and provide a reasonable answer based on their training data.

Minecraft as an Educational Tool

Speaking to New Scientist, lead researcher on the MinePlanner project Steven James observed that future AI models will need to tackle messy problems and he hopes to test how well they deal with unfamiliar scenarios by getting them to play Minecraft.

MinePlanner consists of 15 construction problems, each with an easy, medium, and hard setting, for a total of 45 tasks. To complete each task, the AI may need to take multiple steps. For example, to place a block at a certain height, it may first need to build stairs.

For the same reason that Minecraft has become a popular educational tool for teaching children 3-dimensional problem-solving skills, James thinks it is an ideal test environment for AI models that will be deployed in unpredictable real-world scenarios.

Testing For General Intelligence

To oversimplify a complex process, AI development involves training models on massive data sets and then testing their ability to answer questions and solve problems. To improve the scores, developers then adjust their model weights, following the logic that a higher score on key benchmarks will lead to better performance in real-world applications like chatbots.

This AI learned how to play Minecraft pic.twitter.com/meZCBnav0N

— Dexerto (@Dexerto) May 30, 2023

The problem with the standard approach, however, is that it doesn’t test AI models’ ability to deal with new information, only how well they can recall information from their training data.

The Need For Better Benchmarks

Current AI benchmarks are the equivalent of making students sit a test consisting only of questions from past papers. Someone with a perfect photographic memory could answer every question correctly but still not understand why an answer is correct or how to replicate the result.

While AI has come leaps and bounds in a short space of time, contemporary models still perform badly in areas where they can’t simply regurgitate training data. When tests do assess AI’s capacity for original thinking, the results speak volumes.

Gemini Ultra outperforms human experts on MMLU (massive multitask language understanding): one of the most popular methods of benchmarking AI models.

It involves a combination of 57 test subjects from math to history to law and more. ↓ https://t.co/mwHZTDTBuG pic.twitter.com/587naHIR1Q

— Google DeepMind (@GoogleDeepMind) December 6, 2023

The authors of the industry standard benchmark Massive Multitask Language Understanding (MMLU ) test found that AI models fared the worst on tasks related to calculation-heavy subjects such as physics and mathematics and subjects related to human values like law and morality. For example, when OpenAI’s GPT-3 was prompted with multiple choice questions in a few-shot MMLU test, it scored around 30% in elementary mathematics questions, barely above the 25% baseline score that would result from random guesses.

Of course, AI has evolved since then. But as models improve, new testing methodologies could be required.

If the idea of assessing AI performance with video games seems unusually whimsical for the technology, that’s because whimsy belongs to a higher order or intelligence than what has been demonstrated by existing models. Consider, for example, that while play is observed across the animal kingdom, only mammals and some birds engage in more complex games or puzzles, and the tasks proposed by James et al are the kind that would even challenge the creativity of a young child.

Was this Article helpful? Yes No

AI Minecraft Challenge: Measuring Artificial Intelligence Through Gameplay

Minecraft as an Educational Tool

Testing For General Intelligence

The Need For Better Benchmarks

James Morales

Apple AI Seeks to Outshine Microsoft With New Developer Tool, As Vision Pro Stumbles

Google Launches AI Opportunity for Europe to Ensure “No-One Is Left Behind” As UK Delays On Regulation

Elon Musk Teases An ‘Everything App’: Is X Aiming for Total Platform Takeover?