From imitating human text, it only took AI a few years to imitate human expressions. Is this the year that AI will become more human than humans?
Do you remember when ChatGPT was introduced to the public?
I can recall how the public responded with, “I don’t like this, it’s too robotic.”
And so, a year and a half after the public release, tech titans like OpenAI, Google, and Microsoft continued to strengthen the backbone of their research. Their goal is to alleviate this “too robotic” nature that causes the resistance of the public to adopt AI.
Now, with the rapid advances at play, these AI models have become too humanlike.
So, the public’s response changed.
From, “…it’s too robotic,” to “I don’t like this, it’s too human.”
We’re only four months into this year, but the AI developments have been, by far, the most interesting. However, amidst all these, humans did not change.
The responses still start with, “I don’t like this…”
And then I realized, it’s not the technology, nor the research that troubles us, it’s this novel idea.
True enough, we have evolved to protect our survival–our flight or fight response. It’s deeply embedded in our genome. And this response also applies to our reactions to these humanlike AI models–the uncanny valley. This unnerving emotional response we humans experience once we encounter an object or lifeform that resembles anything similar to our likeness.
This term was first introduced in 1970 by Japanese roboticist Masahiro Mori in an essay for the journal Energy.
In the essay, Mori proposed that as a robot appears more humanlike, the more we grow an affinity towards it…but only to a certain point. Once the resemblance becomes at par with what is real (us), the affinity immediately drops to a valley of eeriness and revulsion.
Fifty years since its publication, and this idea, this fear and revulsion, still applies today.
With the release of research like Microsoft’s VASA-1, Google’s VLOGGER, and Alibaba’s EMO, we humans are faced with the same predicament.
Where does this “certain point” exactly lie in the spectrum of robot and human?
In today’s addition, we’ll dissect Microsoft’s newest research, VASA-1, an AI model that can turn one static photo and audio into a video.
We’ll look into the science behind this technology, similar existing research, and explore its real-life applications.
Let’s see how it all adds up!
✶ The top news explained!
Today’s Equation
Recently, Microsoft published their research titled ”VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time.”
In the paper’s abstract, the authors wrote: “We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip… VASA-1 is capable of not only producing lip movements that are exquisitely synchronized with the audio but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness.”
The technology works by processing a single photo and audio. It creates the lip, eye, and head movements based on the audio clip and then imposes these movements on the image.
The paper highlights that, unlike other similar frameworks, VASA-1 goes beyond high-accuracy lip syncs. The authors stated that: “The creation of expressive facial dynamics and the subtle nuances of lifelike facial behavior remain largely neglected. This results in generated faces that seem rigid and unconvincing. Additionally, natural head movements also play a vital role in enhancing the perception of realism.”
And so, to achieve this desired realism, VASA-1 was trained using VoxCeleb2, a large-scale audio-visual dataset of human speech.
According to the website of VoxCeleb, its dataset is comprised of 7,000+ speakers and 2,000+ hours of video.
This dataset is minuscule when compared to YouTube’s 1M+ hours of video, the rumored dataset of SORA, the text-to-video model by OpenAI.
Nonetheless, it’s not the dataset that makes this framework superior to its predecessors. It’s how they trained the model.
This is how the authors described their framework in their methodology: “Instead of generating video frames directly, we generate holistic facial dynamics and head motion in the latent space conditioned on audio and other signals.”
This means that the researchers did not map out the face as a whole; rather, they mapped out the parts of the face and reconstructed it using the audio.
That’s why the demos are of the highest quality when compared to VLOGGER and EMO.
In the GIF above, you can see how coherent the facial expressions are. The movement of the eyebrows and the forehead, the blinking of the eyes, and the crow’s feet. And, of course, the lips and the smile lines.
Although VASA-1 was only trained on real-life videos, the researchers have discovered that the model also applies to digital art, AI art, illustrations, and even paintings.
As of the moment, one of the many limitations of the model is rendering the upper body, hair movements, and, as usual with AI art, the teeth.
While this technology has numerous use cases, from healthcare to customer service, Microsoft has no immediate plans to commercialize it. In their blog, Microsoft wrote: “Given such context, we have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.”
For now, if you are interested in testing out a product with a similar framework, you can check out HeyGen and Synthesia.
These tools have been around for some time now. But unlike VASA-1, VLOGGER, or EMO, which are based on images, these tools have premade avatars that you can play around with.
It’s actually pretty simple: you choose or create an avatar, record or choose a voice, and write your script or use one of their templates. Easy as that.
Currently, the top use cases for this type of technology are customer service, educational video, training video, and instructional video.
The common denominator amongst these is that they are resource hubs for people to learn mundane tasks.
So, for example, you are running a SaaS company. Instead of hiring a team to produce a walkthrough of your product, you can simply write a script, screen record your screen, and let these avatars do the talking.
With this process, you can scale fast.
According to a case study Synthesia published, their tool helped Zoom save 90% of their time in product training when they started using Synthesia.
90%! That’s a lot.
Imagine all the other productive things you can do, new tools you could develop, or times you can take a rest! (Emphasis on the last one).
What’s in it for u⁺
These are three potential use cases for VASA-1 (if ever it will be released publicly):
Virtual Assistants & Customer Service Representatives
Companies could use VASA-1 to create lifelike avatars of support staff, enabling more engaging and personalized customer interactions without requiring live agents.
Personalized Educational Content
VASA-1 could allow educators to quickly create avatar-based lessons and tutorials using their own likeness and voice, making online learning more relatable and effective for students.
Communication for Individuals with Disabilities
This technology could create expressive avatars for people who have difficulty speaking or are unable to appear on camera, empowering them to communicate more naturally in virtual settings.
Top news of the day!
Daily Additions
Jensen Huang Personally Delivers First NVIDIA DGX H200 to OpenAI
NVIDIA CEO Jensen Huang personally delivered the first NVIDIA DGX H200 to OpenAI, posing with OpenAI’s president Greg Brockman and CEO Sam Altman. The upgraded H200 GPU has 1.4 times more memory bandwidth and 1.8 times more memory capacity compared to its predecessor, the H100, enhancing its capability to handle demanding generative AI tasks.
Read Full Story →
Are Singaporeans being seduced by AI influencers?
AI influencers are becoming increasingly popular on platforms like Instagram and TikTok. Singapore’s Capitaland, a real estate group, created its own AI influencer, Rae, in 2022 to enhance online customer engagement, and she has since fronted campaigns with fashion brands like Gucci and Moschino.
Read Full Story →
‘To the Future’: Saudi Arabia Spends Big to Become an A.I. Superpower
Saudi Arabia is investing heavily in its tech industry to complement its oil dominance, with over $10 billion in deals reportedly sealed during the recent Leap conference in Riyadh. Major tech executives from companies like Amazon, Google, and IBM attended the event, with Amazon’s cloud computing division CEO announcing a $5.3 billion investment in Saudi Arabia for data centers and AI technology.
Read Full Story →
Reid Hoffman conducts a sit-down interview with his AI twin
Reid Hoffman, LinkedIn co-founder and AI pioneer, sat down for an unusual interview with an AI-generated digital twin of himself to discuss topics like regulation, ethics, jobs, and the benefits of using AI to enhance human connections.
Read Full Story →
AI predicts a person’s political stance just by reading their face
The researchers suggest that facial morphology, such as smaller faces for liberals and larger jaws for conservatives, may be correlated with political leanings due to self-fulfilling prophecy effects, and they warn that this technology could be used for targeted political messaging and biometric surveillance.
Read Full Story →
In partnership with Salina
Superb AI tools to add to your life!
Top-Up Your Toolbox
Bange • Create resumes and cover letters in minutes.
Complexity • The world’s knowledge at your fingertips.
Salina • Translate content from 90+ languages with ease.
BreezeDoc • Seamlessly collect signatures from one or multiple signers.
Boom • Make meetings and presentations more engaging, productive, and fun.
Do you want to stay current with the latest AI news?
At ai + u, we deliver fresh, engaging, and digestible AI updates.
Stay tuned for more exciting developments!
Let’s see what stories we can bring to life next.
See you next addition!