Voicebox: Everything about Meta's generative speech AI model

Recently, Meta made public its generative AI model, Voicebox, entering the competition led by Google and OpenAI. It's a speech model that allows you to create audio from your or someone else's voice or even from a text command. Want to know when it's coming out and what is it for? Read on!

What is Voicebox?

Meta's Voicebox is an artificial intelligence model that can perform speech generation tasks, such as editing, sampling, and audio stylization, from a voice sample and text commands. Like other generative models, such as ChatGPT for text or DALL-E 2 for images, Voicebox allows you to work with audio in innovative ways.

One of its peculiarities is that, unlike other AI models, Voicebox does not need to have a large dataset to generate voices, as it can solve tasks thanks to context learning. This model is trained to predict a speech segment when given the surrounding speech and the transcript of the segment. After learning to fill in speech from context, the model can apply it to speech generation tasks. Thus, with just a few seconds of audio, it would be able to replicate an entire conversation with your own voice, for example, or it could eliminate background noise by filling in those gaps with data obtained from the original track.

How does Voicebox work?

As you may already know, artificial intelligence algorithms are trained with different datasets (the more, the better) from which they learn to perform certain tasks. They can refine the results thanks to the constant feedback they receive. Although Voicebox was trained with more than 50,000 hours of audio in different languages, thanks to its Flow Matching method, Meta's model does not need such guided training since it can learn from the material provided at the time. Voicebox outperforms the current VALL-E English model in intelligibility (5.9% vs. 1.9% word error rate) and audio similarity (0.580 vs. 0.681). In other language style transfers, Voicebox outperforms YourTTS with an average word error rate of 5.2% vs. the competition's 10.9% and improves audio similarity from 0.335 to 0.481. Sounds great, right?

What is Voicebox for?

Voicebox opens up a world of possibilities in artificial intelligence, as it can produce high-quality audio snippets and also edit pre-recorded audio (imagine cleaning up audio from annoying background noises) while preserving the original content and style of the audio. To top it off, this model is multilingual and can currently generate voice audio in six languages.

These are just some of the things you can do with this AI:

Text-to-speech context synthesis: the genius of Voicebox is that it goes way beyond tools that generate audio from the text (think, for example, of translation and language learning apps), as you can add an audio sample of just a couple of seconds, and Voicebox will take that speech style to generate audio from the text you type.
Speech editing and noise reduction: Voicebox can recreate a part of speech interrupted by noise or replace mispronounced words without having to re-record parts of speech.
Style transfer between languages: you can insert a speech sample and text in English, French, German, Spanish, Polish or Portuguese, and Voicebox will be able to read the text in any of those languages using the same speech style. This could be used to communicate in other languages while retaining your own style.
Diverse speech sampling: having learned from diverse datasets, Voicebox can generate a voice that is more faithful to how people express themselves in the real world.

Also, people with speech problems could use it to communicate more easily. A translator would allow us to communicate in other languages while preserving our style; another interesting use would be in the Metaverse, where computer-operated characters could be given voice; in the media, it could help edit speeches and other audio material, not to mention all the possibilities it would open up for music production (especially for emerging artists).

How to use Meta's Voicebox?

At the moment, this generative speech model is not available to the general public, as, Meta states "There are many exciting use cases for generative speech models, but because of the potential risks of misuse, we are not making the Voicebox model or code publicly available at this time. While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it's also necessary to strike the right balance between openness with responsibility. With these considerations, today we are sharing audio samples and a research paper detailing the approach and results we have achieved. " This is why Meta makes the results of its research public for the advancement of science, taking care that this knowledge does not facilitate misuse.

What are the risks of using generative speech AI?

The idea of a technology capable of recreating voices opens up a world of possibilities; however, we cannot help but think about the risks involved, from the creation of deep fakes, audios generated by artificial intelligence recreating a person's voice that could be used against politicians, celebrities or, more commonly, to impersonate a person's identity, to the illegal editing of materials. All these uses could represent a headache for the judicial system, as it opens up the possibility of editing or falsifying evidence, defaming public figures, infringing artists' copyrights, etc. As long as there is no progress in terms of security and legislation regarding artificial intelligence, it is dangerous to make this technology available to everyone. Another risk of artificial intelligence is the extreme dependence that we could develop because of how easy it is to use it, as we saw in the movie "Her" by Spike Jonze, for example.