{"id":2093,"date":"2024-03-01T11:52:15","date_gmt":"2024-03-01T11:52:15","guid":{"rendered":"https:\/\/favtutor.com\/articles\/?p=2093"},"modified":"2024-03-01T11:52:16","modified_gmt":"2024-03-01T11:52:16","slug":"emo-ai-alibaba-features-model-training","status":"publish","type":"post","link":"https:\/\/favtutor.com\/articles\/emo-ai-alibaba-features-model-training\/","title":{"rendered":"SORA Got New Competition from EMO for AI Video Generation"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">While we were just still hovering over SORA&#8217;s magnificent capabilities, there is now a new player in the field of Video Generative AI by Chinese company Alibaba. Their EMO AI can make your photos alive, let&#8217;s find out how. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Highlights:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alibaba researchers have developed EMO, a generative AI tool that brings portraits and images to life.<\/li>\n\n\n\n<li>Comes with all-new audio-to-video technology in the form of a Diffusion Model.<\/li>\n\n\n\n<li>Has limitations but still outperforms various traditional models on several benchmark scores.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>What is EMO AI by Alibaba?<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Alibaba announced a new AI tool named EMO that can create realistic-looking videos from existing images and portraits.<\/strong> <strong>The tool can also integrate audio inputs with the video where the subject can speak and sing.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">EMO is short for Emote Portrait Alive. It is developed by researchers at Alibaba\u2019s Institute for Intelligent Computing. This tool is narrowing the gap between realism and artistry by bringing together AI with video generation.  It can animate input images by synchronizing lip and eye movements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is how they described the tool in their <a href=\"https:\/\/arxiv.org\/pdf\/2402.17485.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">research paper<\/a>:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">&#8220;We proposed EMO, an expressive audio-driven portrait-video generation framework. Input a single reference image and the vocal audio, e.g. talking and singing, our method can generate vocal avatar videos with expressive facial expressions, and various head poses, meanwhile, we can generate videos with any duration depending on the length of input audio&#8221;<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Several impressive features come with EMO AI.  Here we have highlighted its key functionalities that you need to know:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>1) Still Portrait Images to Videos<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A single portrait picture can be given life using EMO AI. It creates lifelike movies of the person in the picture, giving the impression that they are speaking or singing.\u00a0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A reference image is used to create a video that replicates the image&#8217;s appearance. It suggests using blended shapes and head positions to teach the image how to create facial expressions and head motions on its own.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">These are then employed to generate a three-dimensional facial mesh, which acts as an intermediary representation to direct the production of the final video frame.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>2) Audio to Video Conversion<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The breathtaking aspect of this tool is that its cutting-edge technology allows it to generate videos directly from audio cues. This is a direct divergence from traditional models which need a text or image-based prompt to generate videos.<\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\" data-media-max-width=\"560\"><p lang=\"en\" dir=\"ltr\">Just a while after Sora by OpenAi, It&#39;s been a busy period for the AI space with announcements from Alibaba, Google, Ideogram and lightrick.<br><br>Here are the most important developments that happened:<br><br>1. Researchers from alibaba unveiled EMO: Emote Portrait Alive by Alibaba, an AI\u2026 <a href=\"https:\/\/t.co\/AZz5AdiuoH\" target=\"_blank\">pic.twitter.com\/AZz5AdiuoH<\/a><\/p>&mdash; Nova (@Novaprayer_) <a href=\"https:\/\/twitter.com\/Novaprayer_\/status\/1763522668035260572?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">March 1, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">This method produces extremely expressive and lifelike animations by ensuring smooth frame transitions and constant identity retention.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3) Capturing Intricate Facial Expressions<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The ultimate stroke of realism is applied as EMO AI captures all the intricate facial movements consisting of lip and eye movements. The ability to synchronize lip movements with audio cues puts it on the pedestal of Gen AI tools.&nbsp;<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The complex and dynamic interaction between aural cues and facial movements is captured by EMO AI. Beyond static expressions, it accommodates a broad range of human emotions and unique facial styles.<\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\" data-media-max-width=\"560\"><p lang=\"en\" dir=\"ltr\">AI Videos disrupt HOLLYWOOD<br><br>This AI can make any image TALK, SING, even RAP<br><br>Here are 10 wild examples of EMO ( Sound on )<br><br>1. Leonardo DiCaprio rapping Eminem <a href=\"https:\/\/t.co\/NVyVEzsugo\" target=\"_blank\">pic.twitter.com\/NVyVEzsugo<\/a><\/p>&mdash; Poonam Soni (@CodeByPoonam) <a href=\"https:\/\/twitter.com\/CodeByPoonam\/status\/1763177706634756150?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">February 29, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Overall, videos including realistic speech and singing in a variety of styles can be produced by this tool. It makes anything come to life, be it beautiful music or a poignant conversation. Imagine being able to transform a static portrait into an animated figure that speaks or sings\u2014EMO makes it feasible!\u00a0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is something we need to see whether <a href=\"https:\/\/favtutor.com\/articles\/try-openai-sora-video-generator\/\">SORA&#8217;s AI Video Generation Tool<\/a> can do the same.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Looking Into EMO AI&#8217;s Build<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>EMO AI\u2019s architecture is mainly based on Diffusion Models, considering the models\u2019 high capability to produce quality images by training on extensive image datasets. <\/strong>They have gone beyond just image generation and developed realistic video generation by integrating audio with the Diffusion Model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Additionally, by developing a comparable module called FrameEncoding to maintain the character&#8217;s identity throughout the video, they improved and used ReferenceNet&#8217;s methodology to guarantee that the character in the output video stays consistent with the input reference image.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Model Training<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Researchers assembled a varied audio-video dataset with over 250 hours of footage and 150 million images to train EMO. This dataset includes a variety of content categories, such as talks, multilingual song performances, and film and television snippets. The abundance of content guarantees that EMO records a broad spectrum of facial expressions and vocalization styles, offering a strong basis for its advancement.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The Methodology<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The frame encoding phase and the diffusion process are the two primary phases of the EMO framework. ReferenceNet gathers features from the motion frames and reference images during the Frames Encoding phase. A trained audio encoder, integration of the face region mask, and denoising operations made possible by the backbone network are all part of the diffusion process.\u00a0<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" width=\"1024\" height=\"726\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-model-training-1024x726.jpg\" alt=\"EMO AI model training\" class=\"wp-image-2094\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-model-training-1024x726.jpg 1024w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-model-training-300x213.jpg 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-model-training-768x545.jpg 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-model-training-120x86.jpg 120w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-model-training-750x532.jpg 750w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-model-training-1140x809.jpg 1140w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-model-training.jpg 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\">Movements are modulated and identity is preserved via attention processes such as Reference-Attention and Audio-Attention. To create a fluid and expressive video production process, Temporal Modules control the temporal dimension by modifying motion velocity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By adding a FrameEncoding module to ReferenceNet, EMO improves its methodology while maintaining compatibility with the supplied reference image. This module makes sure the character&#8217;s identity is maintained during the creation of the video, which enhances the end product&#8217;s realism.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Does EMO\u2019s Workflow Have Any Limitations?<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>One of the biggest limitations of EMO\u2019s model comes in the form of integration between the diffusion model and the audio cues. This leads to stability issues in the form of noises associated with the audio which may be incorporated as imbalances in the resultant produced video.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To solve stability issues associated with the audio\u2019s noise imbalances, Alibaba researchers have also introduced stable control mechanisms namely a speed controller and a face region controller. These two controllers serve as hyperparameters, providing delicate control signals without sacrificing the final produced videos&#8217; expressiveness or diversity.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One of the other limitations that EMO possesses is using a base video, which can limit realism by causing locked head motions and simply producing mouth movements. Furthermore, the restricted representational capacity of the 3D mesh, which limits the overall expressiveness and realism of the output movies, is a recurring problem with these technologies. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Furthermore, the non-diffusion models on which both approaches are based severely restrict the performance of the results that are produced.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Further in the research paper, it is stated that:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">&#8220;There are some limitations for our method. First, it is more time-consuming compared to methods that do not rely on diffusion models. Second, since we do not use any explicit control signals to control the character\u2019s motion, it may result in the inadvertent generation of other body parts, such as hands, leading to artifacts in the video. One potential solution to this issue is to employ control signals specifically for the body parts.&#8221;<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">These are a few of the limitations that the current model struggles with, however, it still offers many benefits and improvements in results compared to previous models and approaches.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Improvement Over Other Models<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">While previous approaches mostly rely on 3D modeling or form blending to mimic facial movement, EMO adopts a more direct approach. It immediately translates voice waves into video frames, producing incredibly lifelike animations that capture each person&#8217;s characteristics and peculiarities.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Extensive experiments described in the research paper demonstrate that EMO greatly outperforms existing state-of-the-art systems in terms of identity retention, emotional conveying capabilities, and video quality. <\/strong>EMO-generated videos were praised even in a user study for being more emotive and natural-looking than those produced by competitors.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Below we have obtained a table from EMO\u2019s <a href=\"https:\/\/arxiv.org\/pdf\/2402.17485.pdf\" target=\"_blank\" rel=\"noopener\">research paper<\/a>, where you can see that it outperforms others in terms of individual frame quality, as indicated by improved FID scores:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img decoding=\"async\" width=\"1024\" height=\"358\" src=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-Benchmarks-and-Comparison-1024x358.jpg\" alt=\"EMO AI Benchmarks and Comparison\" class=\"wp-image-2095\" srcset=\"https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-Benchmarks-and-Comparison-1024x358.jpg 1024w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-Benchmarks-and-Comparison-300x105.jpg 300w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-Benchmarks-and-Comparison-768x268.jpg 768w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-Benchmarks-and-Comparison-750x262.jpg 750w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-Benchmarks-and-Comparison-1140x398.jpg 1140w, https:\/\/favtutor.com\/articles\/wp-content\/uploads\/2024\/03\/EMO-AI-Benchmarks-and-Comparison.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<\/div>\n\n\n<p class=\"wp-block-paragraph\"><strong>EMO&#8217;s model performs exceptionally well at producing dynamic facial expressions, as demonstrated by E-FID, even though it did not receive the highest scores on the SyncNet metric.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>A Challenge to OpenAI\u2019s Sora?<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Emo\u2019s videos on <a href=\"https:\/\/humanaigc.github.io\/emote-portrait-alive\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">GitHub<\/a> also consist of one being used on the AI Lady from Sora. The lady is famous for walking around an AI-generated Tokyo after a rainstorm and can be seen singing Dua Lipa\u2019s \u201cDon\u2019t Start Now\u201d:<\/p>\n\n\n\n<div align=\"center\"><blockquote class=\"twitter-tweet\" data-media-max-width=\"560\"><p lang=\"en\" dir=\"ltr\">This is mind blowing.<br><br>This AI can make single image sing, talk, and rap from any audio file expressively! \ud83e\udd2f<br><br>Introducing EMO: Emote Portrait Alive by Alibaba.<br><br>10 wild examples: \ud83e\uddf5\ud83d\udc47<br><br>1. AI Lady from Sora singing Dua Lipa <a href=\"https:\/\/t.co\/CWFJF9vy1M\" target=\"_blank\">pic.twitter.com\/CWFJF9vy1M<\/a><\/p>&mdash; Min Choi (@minchoi) <a href=\"https:\/\/twitter.com\/minchoi\/status\/1762812204884074979?ref_src=twsrc%5Etfw\" target=\"_blank\" rel=\"noopener\">February 28, 2024<\/a><\/blockquote> <script async src=\"https:\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Although the video has left millions in shock by bringing the lady to life, it&#8217;s hard to deduce any conclusions as of now as to which tool is better. Both Sora and EMO have a bit of similar functions but with whole different technologies. We don\u2019t have access to both tools as of now so it\u2019s hard to imagine if EMO\u2019s life-instilling videos with audio cues can outshine Sora\u2019s text-based video content.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We have yet to find out the quality of the content perceived firsthand, so let&#8217;s wait and find out. At the same time, we have seen Adobe doing wonders with AI in the audio industry with the <a href=\"https:\/\/favtutor.com\/articles\/adobe-music-generative-ai-control\/\">Music GenAI Control Project<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Without a doubt, EMO AI has left a major mark in bringing artistry to life and as developers, we can\u2019t wait to get our hands on the model and check it out for ourselves. It does have limitations but clearly, its cutting-edge technology in the form of audio to videos doesn\u2019t fail to impress. We will keep you updated on any latest information regarding EMO\u2019s release and enhancements!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Find out about the new features of EMO AI for portrait video generation, its model training, and how it can compete with SORA.<\/p>\n","protected":false},"author":15,"featured_media":2098,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"jnews-multi-image_gallery":[],"jnews_single_post":null,"jnews_primary_category":{"id":"","hide":""},"footnotes":""},"categories":[57],"tags":[89,59,60,62],"class_list":["post-2093","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","tag-emo","tag-generative-ai","tag-openai","tag-sora"],"_links":{"self":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/2093","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/comments?post=2093"}],"version-history":[{"count":2,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/2093\/revisions"}],"predecessor-version":[{"id":2099,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/posts\/2093\/revisions\/2099"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media\/2098"}],"wp:attachment":[{"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/media?parent=2093"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/categories?post=2093"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/favtutor.com\/articles\/wp-json\/wp\/v2\/tags?post=2093"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}