Skip to main content

Meta unveils Audiobox, an AI that clones voices and generates ambient sounds

A laptop displaying a close-up of a lipsticked mouth and teeth surrounded by a burst of ambient colors.
Credit: VentureBeat made with Midjourney

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now


Voice cloning is one of the areas rapidly emerging thanks to generative AI. The term refers to replicating a person’s vocal stylings — pitch, timbre, rhythms, mannerisms, and unique pronunciations — through technology.

While startups including ElevenLabs have received tens of millions in funding for dedicating themselves to this pursuit, Meta Platforms, the parent company of Facebook, Instagram, WhatsApp and Oculus VR has released its own free voice cloning program, Audiobox — with a catch.

Unveiled today on Meta’s website by researchers working at the Facebook AI Research (FAIR) lab, Audiobox is described as a “new foundation research model for audio generation” built atop its earlier work in this area, Voicebox.

“It can generate voices and sound effects using a combination of voice inputs and natural language text prompts — making it easy to create custom audio for a wide range of use cases,” according to the Audiobox webpage.


AI Scaling Hits Its Limits

Power caps, rising token costs, and inference delays are reshaping enterprise AI. Join our exclusive salon to discover how top teams are:

  • Turning energy into a strategic advantage
  • Architecting efficient inference for real throughput gains
  • Unlocking competitive ROI with sustainable AI systems

Secure your spot to stay ahead: https://bit.ly/4mwGngO


Simply type in a sentence that you want a cloned voice to say, or a description of a sound you want to generate, and Audiobox will do the rest. Users can also record their voice and have it cloned by Audiobox.

A ‘family’ of audio-generating AIs

Meta further noted that it created a “family of models,” one for speech mimicry and the other for generating more ambient sounds and sound effects such as dogs barking or sirens or children playing, and that they are all “built upon the shared self-supervised model Audiobox SSL.”

Self-supervised learning (SSL) is a machine learning (ML) deep learning technique in which artificial intelligence algorithms are assigned to generate their own labels for data that is unlabeled, as opposed to supervised learning, where the data may already be labeled.

The researchers published a scientific paper explaining some of their methodology and rationale for taking an SSL approach, writing “Because labeled data are not always available or of high quality, and data scaling is the key to generalization, our strategy is to train this foundation model using audio without any supervision, such as transcripts, captions, or attribute labels, which can be found in larger quantities.”

Of course, most leading generative AI models are heavily dependent on human-generated data for training how to create new content, and Audiobox is no exception. The FAIR researchers relied upon “160K hours of speech (primarily English), 20K hours of music and 6K hours of sound samples.”

“The speech portion covers audiobooks, podcasts, read sentences, talks, conversations, and in-the-wild recordings including various acoustic conditions and non-verbal voices. To ensure fairness and a good representation for people from various groups, it includes speakers from over 150 countries speaking over 200 different primary languages.”

The research paper does not specify exactly where this data was sourced from and whether or not it was in the public domain, but that is surely an important question with various artists, authors, and music publishers suing a host of AI companies for training on potentially copyrighted material without the creators/rights owners’ express consent.

Asked about it by VentureBeat, a Meta spokesperson responded with an email stating: “Audiobox was trained on publicly available and licensed datasets,” but did not specify which or where the datasets were derived.

You can try it yourself and clone your own voice now

To showcase the capabilities of Audiobox, Meta has also released a host of interactive demos, including one that lets you record the audio of the user speaking about a sentence’s worth of text and replicates their voice.

Then, the user can type in text that they want their cloned voice to say and hear it read back to them in their cloned voice.

You can try it for yourself here. In my case, the resulting AI-generated cloned audio was eerily similar, though not exactly the same as my own voice (as testified by my wife and child, who heard it not knowing what it was).

Meta also allows users to generate whole new voices from text descriptions of what they should sound like “deep feminine voice” “high-pitched masculine speaker from the U.S.” etc., as well as restyle voices recorded by the user, or type in a text prompt to generate whole new sound. I tried the latter with “dogs barking” and received two versions that were indistinguishable from the real thing in my ears.

Now for the big catch: Meta includes a disclaimer with its Audiobox interactive demos noting that “this is a research demo and may not be used for any commercial purpose(s),” and that it is restricted to those outside of “the States of Illinois or Texas,” which have state laws that prohibit the kind of audio collection Meta is doing for the demos.

Interestingly, like its new Imagine by Meta AI image generation web app unveiled last week, Audiobox also is not open source, bucking Meta’s commitment to the field that was evidenced earlier by the release of its Llama 2 family of large language models (LLMs).

Asked about this, a Meta spokesperson told VentureBeat via email:

“As part of our ongoing commitment to responsible research conduct, we’ll soon be inviting researchers and academic institutions to apply for a grant to conduct safety and responsibility research with Audiobox.

We are releasing Audiobox to a hand-selected group of researchers and academic institutions with a track record in speech research to help further the state of the art in this research area, and ensure we have a diverse set of partners to tackle the Responsible AI aspects of this work.”

So, the technology can’t be used for any moneymaking/business purposes — nor can it be used by residents of two of the most populous states in the U.S. — for now. But with AI advancing at a rapid clip, expect this to change and there to be commercial versions soon, if not from Meta, from others.