People love Minecraft and like to see who can create the most beautiful builds there. Now, if AI models want to be like us, they need to be creative too. So, this new website lets you test and compare which AI models are good at Minecraft.
Minecraft Benchmarking for AI Models
MC-Bench or Minecraft Benchmarking is a website created by a 12th-grader Aditya Singh. On the website (mcbench.ai), you can compare two AI models on how well they can generate innovative Minecraft creations using the same prompt.
MC-Bench serves as a benchmarking platform specifically designed to evaluate AI models’ capabilities in generating Minecraft builds.
Here’s how it works: when you visit the website, you are shown two creations and you have to vote for which one looks better. For example, here are two tables made by two different AI models:
You have to vote for one of them but there is also a “Tie” option if you think both are equally good.
After voting, it will reveal the names of the AI models:
If you are a Minecraft player, this is a fun game you must try once. Here you can use your gaming skills to judge the AI models.
Here is another example of building Frosty the Snowman:
The builds also include Earth from space:
Herein, GPT 4.5 – Preview (2025-02-27) uses a simple Perlin Noise approximation to "Build our Earth as a sphere viewed from space, as detailed and realistic as possible."
— Minecraft Benchmark (@_mcbench) March 14, 2025
Share link below.
cc: @OpenAI pic.twitter.com/8dYfl5GJxi
Even unicorns:
Sometimes a model produces an elegant algorithm for placing blocks.
— Minecraft Benchmark (@_mcbench) March 13, 2025
Other times it does the calculations "in its head."
Herein, GPT 4.5 – Preview (2025-02-27) just lays down the blocks to create "A fancy colorful Unicorn."
Share link below. pic.twitter.com/eVUOtwv3hZ
Overall, users vote on the best Minecraft build before discovering which AI created it. This means it is a human preference leaderboard, just like LMArena.
Minecraft has achieved remarkable success since its release in 2009, becoming the best-selling video game of all time. As of October 2023, it has sold over 300 million copies worldwide. That’s why the creator of this website used Minecraft for benchmarking AI models. He talked about it to Techcrunch:
“Minecraft allows people to see the progress (of AI development) much more easily. People are used to Minecraft, used to the look and the vibe.”
-Aditya Singh
Traditional AI benchmarks typically use complex metrics and programming challenges that are difficult for the average person to understand. While valuable for researchers, these benchmarks often lack accessibility.
There is also a leaderboard available on the website. The #1 spot is currently held by Anthropic’s Claude 3.7 Sonnet. It has a win rate of 86% last time I checked. The runner-up is also an AI model by Anthropic: Claude 3.5 Sonnet. OpenAI’s GPT-4.5 Preview is on number 3.
According to the creator, the leaderboard reflects his own experience with these models, indicating that MC-Bench offers an accurate assessment.
People online are also find this it enjoyable. Some are calling it the “coolest benchmark ever”.
As of 15 March 2025, there are over 10,000 individual build samples have been voted on. There are still 20,000 builds yet to be evaluated, according to the latest update from their X.
Minecraft’s open-ended nature makes it an ideal testing ground for AI creativity. Benchmarking AI models in this environment helps determine how well AI can design within Minecraft’s constraints.
But this is not the first time games have been used for AI research. Classic games like Super Mario Bros, Street Fighter, and Pokemon Red were also used for testing the LLMs recently.
Takeaways
We have seen many ways in which we can test AI models but this is so far the most interesting method I have seen. This also adds some fun in this technical industry that might encourage young minds to get started with the AI world.