EMO, Alibaba’s new audio2video diffusion model, challenges OpenAI’s Sora by letting its videos belt out to Dua Lipa.
This week, the Alibaba Institute for Intelligent Computing introduced “EMO” or Emote Portrait Alive, a state-of-the-art “audio2video” artificial intelligence (AI) model. The model animates static portrait photos–transforming them into lifelike videos. Through EMO, the characters in the image can now speak or sing in sync with the uploaded audio, creating an unreal visual experience.
In the video, we see the AI lady from OpenAI’s Sora, flawlessly lipsyncing to Dua Lipa’s “Don’t Start Now.” This up-close demonstration highlights the substantial progress EMO has achieved, surpassing Sora and venturing beyond mere AI-generated videos.
Moreover, the fact that the reference AI video is just two weeks old highlights the rapid pace of advancements taking place within Alibaba’s research lab.
On top of this, the model’s ability to capture emotion from the uploaded audio is truly remarkable. It’s fascinating to see how it can interpret context solely from the audio file. Watch the video below to witness this firsthand.
Even though the reference image is only AI-generated and doesn’t look realistic at all, it can still sing David Tao’s “Melody” with a full range of emotion. This feature is prefaced on the paper Alibaba published on Arxiv. The researchers wrote:
“In this work, we tackle the challenge of enhancing the realism and expressiveness in talking head video generation by focusing on the dynamic and nuanced relationship between audio cues and facial movements. We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles.”
So, how does it make all these images lip-sync realistically?
Even Ruben Hassid, an AI advocate, highlights the increasing realism of these video generations.
The image below describes the basic structure of how the new model works. Users can upload an audio and then select an image they want the audio to be synced to. The image can be a photograph, a painting, an AI-generated video, or a digital painting.
Despite the complexity of “audio2video” technology, it’s remarkable how effortlessly this model operates. Simply upload an image and audio, and voilà–a new video is created.
Additionally, a user on Reddit even pointed out that this technology will automate TikTok content generation, emphasizing how easy it is to use the model.
Given the model’s simplicity, is there a catch?
Let’s hum to the rhythm!
- What is “audio2video” technology?
- So, What? Instantly Preserve and Restore Images
- Now, What’s in it for u+
What is “audio2video” technology?
“Audio2video” technology refers to the process of converting audio signals into visual representations. This can be achieved through various methods, such as using machine learning algorithms to map audio features to visual features.
This technology is like having a projector with a microphone. You describe a picture to the projector using the microphone and then the projector will generate an image of the description you said.
As “audio2video” technology emerges, it’s important to remember its predecessor, “text2video.”
A notable pioneer in this field is ModelScope, a creation by DAMO, Alibaba Cloud’s research division. This technology gained popularity as the model was used in generating the now iconic “Will Smith Devouring Spaghetti” meme.
While EMO creates videos from audio instructions, ModelScope generates videos from text prompts. With ModelScope and now EMO, you may be curious about Alibaba’s sudden emphasis on video-generating models.
To add to this conversation, a user on X even pointed out that this is such an impressive take on video generation from a non-western company. This comment highlights the prevalent dominance of predominantly white companies in the AI industry. Perhaps this is why Alibaba is joining the AI race, to have Western representation among the tech titans.
For an intriguing twist to the discussion, an Alibaba Cloud blog published on Jan. 26, 2024, talks about this. The blog highlights the Chinese company’s commitment to providing AI solutions to its existing customers and potentially expanding into a wider market. This illustrates that the Chinese tech firm not only values diversity in the AI race but also actively strives to enhance AI-focused services.
With this mission statement, it is evident that Alibaba aims to play a pivotal role in the age of open-source models. Currently, their objective is for their clients to not just depend on their readily available solutions, but for them to empower their businesses through these AI models.
Nevertheless, if you’re enthusiastic about this technology, you may be interested in exploring Pika’s latest animated lip-sync feature.
This example showcases the remarkable lip-sync abilities of Pika’s AI-generated animated videos. In addition, the video production quality is comparable to that of top animated studios such as Pixar, DreamWorks, and LAIKA.
With these rapid advancements, the potential of this technology in the upcoming months (or perhaps even weeks) is truly unpredictable.
Regardless, a.i. + u will keep you in the know.
Instantly Preserve and Restore Images
At present, one of the most compelling applications I envision for this AI model is the restoration of images and videos.
Unlike traditional image restoration methods that depend on manual and time-intensive color-matching techniques, AI-based image restoration leverages advanced algorithms and deep learning techniques. By referencing extensive datasets, AI models can autonomously restore images with exceptional quality and precision.
Just take a look at this post on r/StableDiffusion, a Reddit forum dedicated to AI-generated images and videos.
This is why AI models like EMO and ModelScope will change image and video restoration.
With these technological advancements, we are now able to restore films, historical videos, and personal mementos without the painstaking task of going through each video frame by frame. Consider the wealth of new knowledge that will be unearthed as a result of these restoration efforts.
This not only impacts the world at large but also enriches our own experiences.
What’s in it for u⁺
Currently, some of us are still hesitant to embrace AI as a new form of expression, and I empathize with that perspective.
It’s true, that numerous complexities are associated with generative AI, ranging from copyrights and trademark to deep fakes. Nevertheless, our current progress marks a significant beginning. The ongoing research demonstrates that the technology can achieve feats that appeared impossible a decade ago.
As a creative individual, I vividly recall the resistance that digital art faced upon its public introduction. I distinctly remember how traditional artists would discredit creatives utilizing tools like Adobe Photoshop, ProCreate, or Canva, labeling digital art as “not real art.”
This is the same pattern happening now.
Digital art hasn’t supplanted traditional art, just as animated films haven’t eclipsed live-action features. Instead, digital art and animation have evolved into distinct mediums with their own principles and disciplines.
In the next two to three years, we may witness a significant shift. AI art will carve out its own niche, ushering in a wave of AI artists–a fresh breed of creators harnessing the full power of natural language processing (NLP). After all, the capacity to effectively communicate with a large language model to get the output you envision is an impressive skill in itself.
For instance, take a look at this “Terminator 2” remake titled “Our T2 Remake,” a film entirely made by generative AI. The parody film remakes the 1991 classic through the collaboration of 50 AI artists.
Some may argue that this attempt is disrespectful to the original one, but I think this is an homage to the classics through an excellent employment of modern technology.
With this execution, imagine all of the possibilities. Imagine the ability to view movies with alternate endings or create narratives of our own.
Through the “audio2video” technology, all of these will be possible in a few years.
Today’s Additions!
If you’re interested in this story,
you may want to add these to your media diet:
Lip-sync to the beat of your drum.
Do you want to stay current with the latest AI news?
At a.i. + u, we deliver fresh, engaging, and digestible AI updates.
Stay tuned for more exciting developments!
Let’s see what stories we can bring to life next.
See you next addition!