Key Takeaways
AI-powered video generation has garnered significant mainstream attention since OpenAI unveiled SORA earlier this month. With the public captivated by the quality and realism of the new model’s outputs, other players in the space are under pressure to prove that they are still relevant.
In recent days, text-to-video pioneer Pika and Chinese technology giant Alibaba have both released new AI tools able to animate mouths in sync with audio recordings. Together, Pika Lip Sync and Alibaba’s EMO point to a burgeoning field of AI tools that are geared toward augmenting images and videos rather than creating them from scratch.
It used to be easy to identify AI-generated videos. But the latest generation of diffusion models are so much more capable than their predecessors that their outputs can be nearly indistinguishable from real video footage.
However, SORA and other models like it don’t come with native audio integration.
Syncing AI videos with sound has traditionally been a challenge. And if the audio is meant to line up with lip movements, the results are notoriously clumsy. As the researchers who developed EMO observed, “traditional techniques […] often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles.”
With OpenAI now muscling its way into the AI video generator segment, Pika’s new lip-sync feature could help the platform retain customers in an increasingly competitive market. Meanwhile, following on from the launch of its SORA-style text-to-video platform in December, Alibaba’s take on the concept reflects its rising generative AI ambitions.
Oriented more toward Pixar-style animations than SORA’s pursuit of photorealism, Pika is an ideal candidate for the AI speech depiction.
Granted, the videos showcased so far fall short of the standards professional animation studios can deliver. Nonetheless, they are still an improvement on previous attempts at depicting speech in AI videos.
For its part, Alibaba has approached AI lip-syncing from a slightly different angle, launching EMO as a standalone product that uses AI to make still images talk.
To create EMO, short for Emote Portrait Live, researchers trained their AI model on over 250 hours of footage and more than 150 million images. The training dataset encompassed multiple languages and included speeches, film and television clips and footage of song performances.
Using stable diffusion, the same machine learning process that powers SORA, they built their model to generate clips based on input audio rather than text prompts.
While AI video generators have captured the world’s attention, the latest AI lip sync tools reflect the emergence of an auxiliary trend.
Instead of focusing on creating original scenes, Pika Lip Sync and Alibaba’s EMO are geared toward the augmentation of existing material.
Alongside models designed to increase the quality or fidelity of input videos, they are part of an emerging editorial toolkit that uses generative AI to enhance rather than replace traditional methods of video production.