Chinese AI company Shengshu Technology and Tsinghua University recently unveiled Vidu, a text-to-video model capable of generating high-definition 16-second clips at 1080p resolution in a single click. In the announcement made at the 2024 Zhongguancun Forum in Beijing, they claimed that Vidu is a strong competitor to OpenAI’s Sora.
Highlights:
- Vidu is a Chinese text-to-video model that can generate 16-second 1080p clips in a single click.
- Vidu is positioned as a strong competitor to OpenAI’s groundbreaking text-to-video model Sora.
- It showcases complex physics, realistic visuals, and cultural adaptability but needs to catch OpenAI’s Sora in overall fidelity.
What is Vidu?
Vidu, a text-to-video AI model developed by Shengshu Technology and Tsinghua University in China, is capable of generating 16-second video clips at 1080p resolution. It is based on a self-developed Universal Vision Transformer (U-ViT) architecture, which the company claims allows it to simulate the real physical world with multi-camera view generation and complex scenes adhering to real-world physics, such as realistic lighting, shadows, and detailed facial expressions.
🚨 China just released SORA’s rival “Vidu”
— Sambhav Gupta (@sambhavgupta6) April 27, 2024
This is China's first long duration, high consistency, and high dynamics video modelIt can create videos upto 16s with 1080P in single click.
It excels at simulating the real physical world and also showcases a vivid imagination,… pic.twitter.com/6ThjAxrQs2
Here is a quote from the Vidu official press release:
“Since the release of Sora, the battle for “domestic Sora” has begun. But when the industry focuses on the “long” feature, they all ignore that behind Sora is actually the improvement of comprehensive effects, such as consistency, realism, aesthetics, etc. in long time series. From the perspective of comprehensive effects, “Vidu” is the first and only video model to fully benchmark against Sora at the effect level, not only domestically, but also globally. It is also the first video model to achieve a breakthrough after Sora.”
The emergence of Vidu serves as a resounding declaration of China’s ambition to catch up with and potentially surpass leading US companies like OpenAI in the field of generative AI models. Achieving this will require a significant increase in performance, but Vidu’s rapid progress suggests it is well within reach. Interestingly, the core technology underpinning Vidu’s U-ViT architecture was first proposed by the Shengshu Technology research team in September 2022, predating Sora’s diffusion transformer (DiT) architecture.
Zhu Jun, vice dean of the Institute for Artificial Intelligence at Tsinghua University and chief scientist of ShengShu-AI, said the following about Sora at the forum:
After the release of Sora, we found that it closely aligned with our technical roadmap, which further motivated us to advance our research with determination.
Features of Vidu
During a recent live demonstration, Vidu showcased its ability to simulate the real physical world and generate scenes with intricate details, adhering to the principles of real-world physics, such as accurate light and shadow effects, and capturing delicate facial expressions with remarkable fidelity. Additionally, Vidu’s capabilities extend beyond mere visual realism, as it can generate complex dynamic shots, rather than fixed ones, further enhancing its versatility.
— Angry Tom (@AngryTomtweets) April 27, 2024
Moreover, as a homegrown Chinese model, Vidu boasts a deep understanding of Chinese cultural elements. This enables it to generate images of unique characters such as pandas, loongs, and dragons – a testament to the model’s cultural sensitivity and adaptability.
— Angry Tom (@AngryTomtweets) April 27, 2024
Here are some more examples:
— Angry Tom (@AngryTomtweets) April 27, 2024
— Angry Tom (@AngryTomtweets) April 27, 2024
Comparison with OpenAI’s Sora
While Vidu undoubtedly represents a remarkable achievement and serves as a testament to China’s rapid strides in the field of AI research, it is important to acknowledge that it currently falls short of the industry-leading capabilities of OpenAI’s Sora model. Sora, a pioneering text-to-video model capable of generating continuous videos of up to one minute in length, sets the benchmark for visual fidelity and realism that Vidu has yet to surpass.
However, it is the temporal consistency achieved by Vidu that truly sets it apart, and this technology holds immense potential for further refinement and improvement as research and development efforts continue. The developers at Shengshu Technology are confident in their creation, boasting of Vidu’s “exceptional consistency” within generated scenes, where individual images build logically upon one another.
One plausible explanation for the current disparity between Vidu and Sora’s capabilities could be the relatively limited access to cutting-edge GPU resources in China compared to the resources available to a technological behemoth like OpenAI. Nevertheless, the emergence of Vidu serves as a resounding declaration of China’s unwavering ambition to not only catch up with but potentially surpass leading US companies in the intensely competitive race for dominance in the field of generative AI models.
While Vidu may currently lag behind Sora in terms of overall visual fidelity, its potential for growth and refinement is undeniable. As China continues to invest in cutting-edge AI research and development, further advancements in Vidu’s capabilities are inevitable, setting the stage for a future where the line between reality and artificial creation becomes increasingly blurred.
There are also doubts regarding Vidu’s claimed ability to generate video clips of up to 16 seconds in length. While the developers at Shengshu Technology assert that Vidu can produce 1080p video clips spanning 16 seconds, the demonstrations and samples released thus far have only showcased clips ranging from 3 to 5 seconds in duration.
Let's be honest. Vidu isn't that impressive.
— Min Choi (@minchoi) April 27, 2024
Supposedly, text-to-video that can generate up to 16 seconds at 1080p.
Clips in this demo are barely 3 seconds. 🤷♂️pic.twitter.com/TKnjAJor63
How to Access Vidu?
Users cannot directly access Vidu for their usage. However, they can fill out a form and apply for access to the text-to-video model.
Here is how you can apply for access:
- Click on the following link: https://www.shengshu-ai.com/home
- If you do not understand Chinese, you can use Google Translate to translate the language to your liking.
- Scroll to the video generation section.
- Click on the ‘Apply for Use’ button. You will be directed to a form as seen below. Fill out the form and apply for access.
Conclusion
The emergence of Vidu, a Chinese text-to-video AI model, showcases impressive advancements in generative AI technology, positioning itself as a competitor to OpenAI’s Sora. While Vidu demonstrates strengths in realism and cultural adaptability, it aims to enhance fidelity further to challenge industry-leading models like Sora in the future.