VLOGGER AI: Google brings your still photos to life and lets you control them with your voice
From photos to videos! Learn all about VLOGGER, Google's new AI and video generator that transforms photos into voice-controllable avatars. Discover its advantages over other similar tools, as well as its limitations and possible uses.
Google researchers haven't stopped lately, releasing a flurry of new models and ideas. The latest is a tool that allows you to turn a still image into a controllable avatar, following in the footsteps of an AI capable of playing video games. This innovative technology, part of Google's new Gemini model, is set to revolutionize the way we interact with avatars and multimedia content.
What is VLOGGER?
It is an AI model capable of creating an animated avatar from a still image, maintaining the realistic appearance of the person in the photo in each frame of the final video.
According to the research paper titled "VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis," the AI model can take as input a photo of a person and an audio clip of that person speaking. From there, the model generates a video that matches the audio, showing the person saying the words and making corresponding facial expressions, head movements, and hand gestures. While the videos are not perfect and may contain some visual flaws, they represent a significant leap in the ability to animate still images.
How does VLOGGER work?
Google recently published a post on its GitHub page introducing the VLOGGER AI model. Users simply enter a portrait photo and audio content. The model can make these characters "move" and have facial expressions. In addition, the image can also read the audio content aloud.
The architecture of the model and how it works
VLOGGER is based on the broadcast architecture that drives text-to-image, video, and even 3D models such as MidJourney or Runway, but adds additional control mechanisms.
To generate the avatar, VLOGGER follows multiple steps. First, it takes audio and image as input, then runs them through a 3D motion generation process. Subsequently, a "temporal diffusion" model determines the timing and motion. Finally, it is expanded and converted into the final output.
In essence, VLOGGER builds a neural network to predict the movement of the face, body, pose, gaze, and expressions over time. It uses the still image as the first frame and audio as a guide.
Training with a large multimedia dataset
To train the model, Enric Corona, a Google researcher, led a team that trained an artificial intelligence model with a large multimedia dataset called MENTOR that makes VLOGGER's functions possible. That data consists of 800,000 videos of different people talking, where every part of their face and body is labeled at all times.
In which cases can VLOGGER be used?
Google researchers foresee several use cases for VLOGGER:
- Video translation: For example, VLOGGER could take an existing video in a given language and edit the lips and face to match the newly translated audio.
- Animated avatars: VLOGGER could create animated avatars for virtual assistants, chatbots, or virtual characters that look and move realistically in a game environment. Similar tools already exist, such as Synthesia, but this new model seems to greatly simplify the process.
- Low bandwidth video communication: A future version of the model could allow video calls from audio, animating an avatar with a still image. This could be especially useful for virtual reality environments on devices such as Meta Quest or Apple Vision Pro, working independently of the platform's avatar models.
What are the advantages of VLOGGER over other similar tools?
- Versatility: as mentioned above, artificial intelligence can be used for a wide range of applications, from video translation to avatar creation for gaming, education, customer service, and more. Its flexibility makes it a tool adaptable to the specific needs of each user.
- Accessibility: VLOGGER has the potential to democratize access to the creation of realistic avatars, allowing even users with no previous experience in animation or design to create attractive and professional content.
- Efficiency: streamlines the avatar creation process, reducing the time and resources required compared to traditional methods. This makes it ideal for projects that require fast and efficient production of multimedia content.
- Compared to other tools: Currently, similar tools exist to some extent, such as Pika Labs' lip sync, Hey Gen's video translation services and Synthesia. However, VLOGGER seems to be a simpler option and requires less bandwidth.
What are the disadvantages and risks of VLOGGER?
- Imperfect fidelity: While VLOGGER is an interesting development, it is a research prototype and not a final product. While it is capable of generating realistic-looking movements, the final video may not always match how the person moves. In essence, it is still a diffusion model and these are characterized by their tendency to have unusual behaviors.
- Motion limitations: The development team recognizes that VLOGGER also has difficulties with particularly large movements or diverse environments. In addition, it can only handle relatively short videos.
- Restricted access: VLOGGER is in the research phase and is not yet available to the public.
- Impersonation: the tool could be used to create fake videos pretending to be of real people, which could have serious consequences.
- Misinformation: AI's ability to generate realistic videos could facilitate the creation of misleading or uninformative content.
- Social engineering: Scammers could use VLOGGER to create convincing avatars posing as trusted individuals to manipulate people and obtain personal or financial information. It is important to be on the lookout for any unusual requests or suspicious behavior from people online, even if they appear to be acquaintances or friends.
The development of VLOGGER must be accompanied by a thorough reflection on its ethical and social implications. Measures are needed to ensure that this technology is used responsibly and does not pose a threat to people's security or privacy.