Meta AI Introduces Voicebox, an AI Model That Can Generate Speech Completely on Its Own

Meta AI Labs has introduced a unique model of artificial intelligence, designed to do for speech what other AI services can do with text and images. The new “Voicebox” generative AI model can generalize to speech generation tasks for which it is not specifically trained, with “state-of-the-art performance”, according to the researchers.

Voicebox Can Synthesize Speech in Six Languages

Meta AI describes the new AI service in a blog post. They explain that like generative systems for graphics and text, Voicebox “creates outputs in a wide variety of styles”, and can start from scratch and modify a given example. The difference, of course, is that Voicebox produces high-quality audio clips rather than an image or text. The researchers claim that the model can synthesize speech in six languages, and can also remove noise, edit content, convert styles and “generate various samples”.
Meta trained Voicebox with over 50,000 hours of recorded speech and transcripts from public domain audiobooks. These contain data in English, French, Spanish, German, Polish and Portuguese. The AI is trained to predict a segment of speech when given the surrounding speech and transcript of the segment. This is then applied in speech generation tasks.

What Makes Voicebox Special?

The main technological breakthrough that makes this new AI model unique is its ability to synthesize speech completely autonomously. Before Voicebox, generative AI for speech required specific training for each task using carefully prepared training data. Voicebox can learn “simply from raw audio and an accompanying transcript,” the researchers said.
To make the AI output sound more “human”, Meta built Voicebox based on a method which is called Flow Matching (FM). This helps Voicebox outperform Microsft’s VALL-E in terms of intelligibility and audio similarity, they claim. Meta’s generative AI is currently available in WhatsApp, Messenger and Instagram.