Microsoft’s new VASA-1 AI framework generates super-realistic talking heads that can even sing songs – Business

admin 19th April 202419th April 2024

Spread the love

Microsoft Corp. has published a research paper that introduces a new kind of artificial intelligence framework that makes it possible to upload a still photo, add a voice sample and create a super-realistic talking head that looks and sounds like the real person.

The new framework is called VASA-1, and it takes a single, portrait style image and an audio file and merges them together in such a way that it can create a short video of a talking head with realistic facial expressions, head movements and even the ability to sing songs in the uploaded voice.

Microsoft said VASA-1 is currently only a research project, so it’s not making it available for anyone else to use, but it posted a number of demonstration videos with dazzling realism.

Although Nvidia Corp. and Runway AI Inc. have both released similar technology, VASA-1 seems to be able to create much more realistic talking heads, with reduced mouth artifacts.

The company said the new framework is specifically designed for the purpose of animating virtual characters, and so all of the individuals in its examples are synthetic, generated using OpenAI’s DALL-E image generating model. However, it clearly has the potential to go further, because if it’s possible to animate an AI image, it should be just as easy to animate a photo of a real person.

In the demo, the talking heads appear to be real individuals that were filmed, with smooth, natural-looking movements. The lip-sync capabilities are especially impressive, and it’s very difficult to discern any unnatural-looking movements.

Equally impressive is that VASA-1 doesn’t seem to require a traditional, face-forward, passport or portrait style image to work. In the examples there are shots of heads facing in slightly different directions. The model also offers a high level of control, using things such as eye gaze direction, head distance and even emotional expressions as inputs, adding to the realism.

Big potential and big risks

In terms of practical applications, one of the most obvious use cases would be video games. VASA-1 could enable developers to create more realistic AI-generated characters with extremely natural lip syncing movements and facial expressions, boosting immersion. The technology could also be used to create avatars in social media videos, and perhaps even go further and enable more realistic AI-generated movies or music videos where it genuinely appears as if the actor, actress or singer is really talking or singing.

Besides its ability to lip-sync talking heads perfectly with an uploaded song, VASA-1 can also handle nonhuman images, including the Mona Lisa rapping the words of Paparazzi:

Microsoft just dropped VASA-1.

This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba

10 wild examples:

1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD

— Min Choi (@minchoi) April 18, 2024

That said, just as there is potential for creativity, there is undoubtedly potential for this technology to be misused. VASA-1 would certainly make the life of anyone invested in creating deepfake videos much easier. For instance, someone could upload a headshot of Donald Trump, followed by a short audio clip of his voice, then create a realistic video of him saying whatever they want him to say.

The risk of misuse explains why Microsoft is being so guarded about the project. “Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications,” Microsoft’s researchers said. “It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans.”

As such, the company said there are no plans to release an online demo, product or additional implementation details at present, adding that it will only consider doing so when it’s certain that the technology will be used responsibly.