We know AI can play video games, but now it can play like humans and not just to beat high scores! Google’s DeepMind team unveiled the world’s first AI Gaming Agent called SIMA. So, let’s look further into its amazing features!
Highlights:
- Google’s DeepMind unveils SIMA, short for Scalable Instructable Multiworld Agent.
- This AI is trained in various gaming environments with the help of natural language instructions from users.
- Tested on 9 games, with the help of 4 testing environments and several pre-trained models.
What is Google’s SIMA?
SIMA is the world’s first AI gaming agent designed by Google and trained in several virtual gaming environments. It has been described as a basic generalist agent that acts as a general instructable game-playing agent.
SIMA stands for Scalable Instructable Multiworld Agent. It can work with just natural-language instructions to play 3D video games, just like humans.
The goal is not to win the game with the best score possible, which other AI can do by accessing the game’s source code. SIMA only gets image inputs from the screen and instructions from the users. Now, it plays the video game by only using keyboard and mouse controls.
Introducing SIMA: the first generalist AI agent to follow natural-language instructions in a broad range of 3D virtual environments and video games. 🕹️
— Google DeepMind (@GoogleDeepMind) March 13, 2024
It can complete tasks similar to a human, and outperforms an agent trained in just one setting. 🧵 https://t.co/qz3IxzUpto pic.twitter.com/02Q6AkW4uq
SIMA can follow instructions in a variety of gaming environments in different video settings and learn accordingly. It can use the virtual environments as a reference to even innovate further and expand its knowledge base.
This update comes days after Google launched its Genie AI that can create entire playable virtual environments.
How was SIMA trained?
This AI agent was exposed to several virtual gaming environments thanks to DeepMind’s partnership with several game developers.
Google worked with eight different game companies to train and test SIMA on nine distinct video games, including Hello Games’ No Man’s Sky, Tuxedo Labs’ Teardown, Valheim, and Wobbly Life.
Every game in SIMA’s library introduces players to a brand-new interactive environment and a variety of new abilities to pick up, such as basic menu navigation and resource mining, spacecraft navigation, and helmet construction.
Secondly, four research environments were used to train SIMA. According to their research paper, They chose 3D embodied environments that provide a wide variety of unrestricted interactions—rich and profound linguistic interactions are possible in these kinds of environments.
One of the chosen environments named Construction Labs was built with Unity.
Construction Labs provides agents with a brand-new study setting where participants must construct unique objects and sculptures out of interlocking construction blocks, such as dynamic devices, ramps to climb, and bridges to cross. Cognitive skills including manipulating objects and having an intuitive grasp of the physical world are its main focus.
The other three environments namely Playhouse, ProcTHOR, and WorldLab were used for graphical interactions, data collection, and physics simulation respectively.
Pre-Trained Models
First, in addition to components that were trained from scratch, SIMA’s agent architecture also includes several pre-trained models, such as Phenaki, a video prediction model, and SPARC, a model that was trained on fine-grained image-text alignment. It also includes a text encoder for pre-processing and collection of text-based input data.
The agent can leverage internet-scale pretraining while maintaining specificity in the settings and control tasks it faces by combining these pre-trained models with fine-tuning and from-scratch training.
Transformers
To create a state representation, SIMA’s agent makes use of encoded language instruction, Transformer-XL that attends to prior memory states, and trained-from-scratch transformers that integrate into the various pre-trained vision components.
A policy network that generates keyboard-and-mouse actions for sequences of eight actions is fed the resultant state representation as input. The agent is trained by behavioural cloning with an additional goal of completion prediction.
Classifier Free Guidance
When the trained agent was run in an environment, the Classifier-Free Guidance was also utilized to enhance the linguistic conditionality of the agent. Although CFG was first suggested to improve text-conditioning in diffusion models, it has also shown promise in language models and language-conditioned agents.
SIMA’s Workflow: How Does the Agent Operate?
SIMA can recognize and comprehend a range of settings before acting to accomplish a given task. It consists of a video model that forecasts the next scene on the screen and a model for accurate image-language mapping.
Using training data in 3D settings, Google improved these models.
This is highly efficient and optimal when it comes to data collection. Compared to other traditional AI models, SIMA says goodbye to the need for a gamer’s source code or APIS.
It just needs two inputs: the user’s straightforward, natural language instructions and the pictures displayed on the screen. To carry out these commands, SIMA controls the game’s main character via keyboard and mouse outputs.
Because of its straightforward, widely-used interface, SIMA can theoretically communicate with any virtual environment.
Further, this agent translates linguistic commands and visual observations into keyboard and mouse movements. When the user provides the right directions, it breaks down activities into simpler subtasks that can be reused in whole new situations and circumstances.
Followed by the data collection process from the user instructions, the agent is trained on the data using the pre-trained models, environments, transformers, and the CFG. This is highly important to make it intelligent and interactive across several environments.
Performance Compared to Generalized Gaming Environments
When Google DeepMind assessed SIMA agents that were trained on a selection of nine 3D games from their library, they considerably outperformed all specialized agents that were trained exclusively on those games.
In the test, DeepMind evaluated SIMA’s environment-specialized agents’ performance in following instructions to complete nearly 1500 unique in-game tasks, in part using human judges. They used SIMA’s performance as a baseline comparison against three types of generalist SIMA agents, that were trained across multiple environments.
SIMA’s Research Update
Google’s DeepMind said that SIMA is still in the development phase and requires more research to perform at never-seen-before levels of human-level performance. In the official announcement, they stated:
“SIMA’s results show the potential to develop a new wave of generalist, language-driven AI agents. This is early-stage research and we look forward to further building on SIMA across more training environments and incorporating more capable models.”
They will expose the gaming agent to more training worlds and improve its abilities in the future. The idea is to build general AI agents that can do different tasks.
Conclusion
SIMA is an exceptional advancement in the world of AI agents. The idea of training from video games has become a reality. This tool will soon allow developers to have diverse virtual environments on the tip of their fingers!