The Mona Lisa who raps? New Microsoft AI animates faces from photos

I Ryu/Visual China Group/Getty Images

A Microsoft sign is seen at the company’s headquarters on March 19, 2023 in Seattle, Washington.


New York
CNN

The Mona Lisa can now do more than smile, thanks to new artificial intelligence technology from Microsoft.

Last week, Microsoft researchers detailed a new AI model they developed that can take a still image of a face and an audio clip of someone speaking and automatically create a realistic-looking video of that person speaking. The videos – which can be made from photorealistic faces, as well as cartoons or works of art – are complete with immersive lip sync and natural facial and head movements.

In one demo video, researchers showed how they animated the Mona Lisa to recite a comedic rap by actor Anne Hathaway.

The results of the AI ​​model, called VASA-1, are both entertaining and a little shocking in their reality. Microsoft said the technology could be used for education or “improving accessibility for people with communication difficulties,” or possibly to create virtual companions for people. But it’s also easy to see how the tool can be misused and used to impersonate real people.

It’s a concern that goes beyond Microsoft: As more tools emerge to create compelling AI-generated images, videos, and audio, experts worry that their misuse could lead to new forms of misinformation. Some also worry that the technology could further disrupt creative industries, from film to advertising.

For now, Microsoft said it does not plan to immediately release the VASA-1 model to the public. The move is similar to how Microsoft partner OpenAI is handling concerns surrounding its AI-generated video tool Sora: OpenAI teased Sora in February, but so far has only made it available to some professional users and cybersecurity professors for testing purposes.

“We oppose any behavior that creates deceptive or harmful content from real people,” Microsoft researchers said in a blog post. But, they added, the company has “no plans to publicly release the product” until we are confident the technology will be used responsibly and in accordance with appropriate regulations.”

Microsoft’s new AI model has been trained on countless videos of people’s faces as they speak, and is designed to recognize natural facial and head movements, including “lip movements, (non-lip) expression, eye gaze and blinking, among others.” , researchers said. The result is a more lifelike video when VASA-1 animates a photo.

For example, in a demo video with a clip of someone sounding excited, apparently while playing video games, the speaking face has furrowed eyebrows and pursed lips.

The AI ​​tool can also be controlled to produce a video in which the subject looks in a certain direction or expresses a specific emotion.

If you look closely, there are still signs that the videos are machine-generated, such as irregular blinking and exaggerated eyebrow movements. But Microsoft said it believes its model “significantly outperforms” other similar tools and “paves the way for real-time interactions with lifelike avatars that mimic human conversational behavior.”