A few days ago, at the Google I/O event, the company announced Veo, their most advanced text-to-video model yet. The model and incredible demo by Google immediately sparked comparison with OpenAI’s latest viral video generation model, Sora.
They also told us that Veo is currently only available for private preview and to a select number of creators at present. There is, however, an opportunity to sign up on a waitlist for when the platform officially rolls out.
Is Veo better than OpenAI’s Sora?
Opinions differ on whether this new model is truly better than Sora, which had set the AI community abuzz with its hyper-realistic visuals and was the first realistic text-to-video model to be announced.
There is a major difference in how both Google and OpenAI are approaching the video generation space. Google has created a waitlist to test the model and intends to release it for beta testing soon.
Whereas OpenAI has been courting Hollywood and there was heavy speculation regarding Sora potentially being licensed to a movie studio. There has been no official statement on whether Sora will be open to the public soon. The large amount of computing each generation requires might be a significant obstacle to its release.
With neither of these models released to the public, it is difficult to make a true comparison of what each model can do. However, we do still have the demos. Yesterday, Google posted a few new videos on X showing users what the capabilities of VEO are.
Introducing Veo: our most capable generative video model. 🎥
— Google DeepMind (@GoogleDeepMind) May 14, 2024
It can create high-quality, 1080p clips that can go beyond 60 seconds.
From photorealism to surrealism and animation, it can tackle a range of cinematic styles. 🧵 #GoogleIO pic.twitter.com/6zEuYRAHpH
They said that the model’s emphasis is on quality and camera control. Veo generates videos that can be extended up to a minute with a diverse set of cinematic styles.
“Veo is our most capable video generation model to date. It generates high-quality, 1080p resolution videos that can go beyond a minute, in a wide range of cinematic and visual styles. It accurately captures the nuance and tone of a prompt and provides an unprecedented level of creative control — understanding prompts for all kinds of cinematic effects, like time lapses or aerial shots of a landscape.”
While Sora can also create minute-long videos, OpenAI focused its model on realism and a deep understanding of the physical world.
“Sora is able to generate complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background. The model understands not only what the user has asked for in the prompt, but also how those things exist in the physical world.”
Sora posted this video with the prompt: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous Sakura petals are flying through the wind along with snowflakes.”
Introducing Sora, our text-to-video model.
— OpenAI (@OpenAI) February 15, 2024
Sora can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions. https://t.co/7j2JN27M3W
Prompt: “Beautiful, snowy… pic.twitter.com/ruTEWn87vf
Google Deepmind posted this video on the official blog with the prompt: “A lone cowboy rides his horse across an open plain at a beautiful sunset, soft light, warm colors”.
✍️ Prompt: “A lone cowboy rides his horse across an open plain at beautiful sunset, soft light, warm colors.” pic.twitter.com/D8uKDZVWto
— Google DeepMind (@GoogleDeepMind) May 14, 2024
With smaller prompts, VEO created an aesthetically pleasing video with a beautiful rendition of a sunset. The footage seems extremely realistic with proportional representations. The video is quite simple and consistent with the prompt. Yet, it lacks some of the creative flair that Sora adds to its videos.
The Sora prompt is quite detail-oriented and the video has captured all the detail perfectly. The video is composed of a swooping camera following a couple through the streets. However, the video seems like it has been taken through a narrow lens, like a GoPro, rather than a natural walk through the streets.
In another example, Sora posted with Prompt: “Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow-covered trees and dramatic snow-capped mountains in the distance, mid-afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.”
Prompt: “Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance… pic.twitter.com/Um5CWI18nS
— OpenAI (@OpenAI) February 15, 2024
Google posted Prompt: “Many spotted jellyfish pulsating underwater. Their bodies are transparent and glowing in the deep ocean.“
✍️ Prompt: “Many spotted jellyfish pulsating under water. Their bodies are transparent and glowing in deep ocean.” pic.twitter.com/y9SmNd8NK0
— Google DeepMind (@GoogleDeepMind) May 14, 2024
Once again we notice the detail provided in the prompt of Sora. Every element of the video has been specified in the prompt for the video to be exactly as the user envisions. The video is perfectly proportional and extremely life-like.
Google again has just a single-line prompt but a highly detailed video from the light dancing on the far-off surface of the water to the movement of the animals. However, there are limits to its creative ability as the model weighs coherence and consistency over aesthetic ability.
There are many such examples for both Sora and Veo. But an extremely important thing to note is that Google has not yet demonstrated any video that perfectly renders human faces. It has created silhouettes and animals but no humans.
Sora on the other hand has generated highly realistic humans with facial details perfect. The faces generated by Sora exhibit complex emotions and details like facial lines. Take a look here:
Prompt: “A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.” pic.twitter.com/0JzpwPUGPB
— OpenAI (@OpenAI) February 15, 2024
So which one is better as of right now by just looking at the previews? We would say that depends on the specific needs of the project.
SORA shines when the priority is crafting compelling narratives, evoking emotional connections, and encouraging creative expression.
If maintaining high visual coherence, exerting precise control over the output, and seamlessly extending video duration is paramount, VEO emerges as the right choice.
Conclusion
While both models have their limitations like Sora’s prompt length and video composition, and VEO’s lack of detail and limited creative expression, we can expect both SORA and VEO to improve their strengths and address these limitations. Once the models are released, perhaps a clear winner will emerge.