At Google’s biggest developer conference, Google I/O, Google released some amazing products. Following OpenAI’s spring update on Monday, where they released their advanced real-world interactive model, GPT-4o, Google has announced its own competitor, the Astra Project.
Google DeepMind just announced Project Astra.
— Rowan Cheung (@rowancheung) May 14, 2024
It’s a universal AI agent that can see AND hear what you do live in real-time, and take action on your behalf.
Google just made it very clear that it’s transforming Gemini from a chatbot into a personal AI agent.
Public access… https://t.co/wdj7dWeucn
Astra stands for Advanced Seeing and Talking Responsive Agent. It is built as a universal agent, designed to be useful in real life. To be truly useful, an agent needs to understand and respond to the complex and dynamic world just like people do — and take in and remember what it sees and hears to understand the context and take action.
Astra leverages the camera on user devices to act as a real-time assistant. It would be as if your assistant was walking right next to you!
According to Demis Hassabis, the CEO and cofounder of Google DeepMind, their long-standing ambition has been to develop a versatile artificial agent capable of seamlessly integrating into daily life.
“We’ve always wanted to build a universal agent that will be useful in everyday life,” Hassabis stated.
Hassabis further elaborated, “Imagine agents that can see and hear what we do, better understand the context we’re in, and respond quickly in conversation, making the pace and quality of interaction feel much more natural.” Ultimately, Astra aims to facilitate natural and enriched interactions with artificial intelligence.
Astra will be accessible through the Gemini app later this year.
The MIT technology review called Astra the “ AI for everything”. So what are the capabilities of Astra? Google says that users can converse with Astra as they would with any other person, naturally and without any latency.
It can not only identify the objects visible on the screen but also remember them for future reference! That is like having an assistant with a photographic memory. This is eerily reminiscent of a particular episode of Black Mirror where the characters can record their memories and watch them back.
The agents that power Project Astra were built on Google’s Gemini model and other task-specific models. It can process info faster by continuously processing video and speech input.
Google Deepmind posted a video of their live demonstration.
During the demo, a user directed their smartphone camera and smart glasses towards various objects, prompting Astra to explain. When the individual pointed the device out the window and inquired, “What neighborhood do you think I’m in?” the AI system successfully identified King’s Cross, London, which happens to be the location of Google DeepMind’s headquarters. Remarkably, Astra also demonstrated its ability to recall and acknowledge that the person’s glasses were situated on a desk, having previously recorded them during the interaction.
When the phone’s camera was pointed at a table, and the user inquired about the source of sound, Astra pinpointed a computer speaker. Subsequently, the user pointed at the top portion of the speaker with their finger, prompting Astra to identify that specific component. The AI system accurately recognized and responded that the indicated part was a tweeter.
From there, Astra was able to provide an interesting limerick (a type of poetry) about some colored pencils and recognize what part of the code does when being pointed at a computer monitor.
The most incredible part of this demo was when the user demonstrated astra on smart glasses. Astra answered questions on the fly, providing input on whatever visual is seen through the smart glass.
Does it measure up to GPT-4o?
The major advantage Astra has over GPTt-4o is its ready integration with smart glasses. Other than that, both seem to be quite similar in functionality.
Astra appears to have slightly longer latency than GPT-4o but the voice assistant in the demo hasn’t shown quite as much emotional range as the chatGPT voice assistant.
GPT-4o can also use the built-in user device camera to perceive the world around it and has a more human-like expressiveness in its voice, something that Astra does not seem to have. GPT-4o can simulate many emotions like surprise and even flirtatiousness.
Who’s the Winner?
We can’t say yet. Both of these vision models aren’t available for everyone to test yet so all we have to compare are the demo videos. OpenAI posted a large number of demo videos showcasing the different capabilities of the model. Google has released just one video, so we will have to wait and watch to find out which model comes out on top!