VASA-1: Microsoft's AI makes portraits speak ultra-realistically!

Microsoft unveils VASA-1, an artificial intelligence that animates photos and makes them speak in an ultra-realistic way. The result is simply stunning! The trick is to avoid any excesses.

Microsoft is betting big on artificial intelligence, to the point of investing tens of billions of dollars in it. Quite simply, the company is integrating it into all its services, whether it's its Microsoft 365 office suite, its Edge browser, its Bing search engine, or its Windows tools... Thanks to its partnership with OpenAI, it is developing incredible technologies, such as its Copilot assistant, its image generator, or VALL-E, the AI that imitates human voices. This time, on its blog, the Redmond firm unveils VASA-1, an artificial intelligence capable of animating photos of faces and making them speak in an ultra-realistic way. All it needs is a photo taken in portrait mode and some audio, and it produces a video offering precise lip-synchronization, stunning facial animations, and natural head movements. The result is as incredible as it is disturbing...

VASA-1: impressively realistic results

Microsoft researchers have achieved this feat by combining several complex technologies with deep learning. VASA-1 is capable of generating high-definition video (512 x 512) at 40 frames per second. We'll say it again, but the result is simply breathtaking. You get the impression of seeing real people talking, with all the nuances and subtleties of facial expressions. Lips move in rhythm with the words, eyes blink and look naturally - though the gaze is sometimes a little blank - eyebrows raise and frown... What's more, the AI can animate illustrations, take over audios in different languages, and even sing. In fact, the Mona Lisa can be seen trying her hand at rap, and it's well worth the detour. A few details betray the deception. Expressions can seem a little exaggerated, while the numerous head movements can seem a little artificial. What's more, the AI only handles the upper body and doesn't take into account non-rigid elements such as hair or clothing. But apart from that, the result is impressive!

In the future, VASA-1 could be very useful for anything that requires realistic talking avatars, e.g. in video games, for educational tools, in therapy, etc. But the result is so realistic that there are legitimate concerns about the deepfakes that such technology can generate. The Microsoft teams are well aware of this, and admit that VASA-1 "could be misused to impersonate human beings". As a result, the researchers have "no intention of releasing an online demo, API, product, additional implementation details or any related offerings until [they] are certain that the technology will be used responsibly and in accordance with appropriate regulations". That's a good thing, because we still remember the fake audio of Emma Watson reciting Mein Kampf...