Imagine typing "dramatic intro music" and hearing a soaring symphony or writing "creepy footsteps" and getting high-quality sound effects. That's the promise of Stable Audio, a text-to-audio AI model announced Wednesday by Stability AI that can synthesize stereo 44.1 kHz music or sounds from written descriptions. Before long, similar technology may challenge musicians for their jobs.
If you'll recall, Stability AI is the company that helped fund the creation of Stable Diffusion, a latent diffusion image synthesis model released in August 2022. Not content to limit itself to generating images, the company branched out into audio by backing Harmonai, an AI lab that launched music generator Dance Diffusion in September.
Now Stability and Harmonai want to break into commercial AI audio production with Stable Audio. Judging by production samples, it seems like a significant audio quality upgrade from previous AI audio generators we've seen.
On its promotional page, Stability provides examples of the AI model in action with prompts like "epic trailer music intense tribal percussion and brass" and "lofi hip hop beat melodic chillhop 85 bpm." It also offers samples of sound effects generated using Stable Audio, such as an airline pilot speaking over an intercom and people talking in a busy restaurant.
To train its model, Stability partnered with stock music provider AudioSparx and licensed a data set "consisting of over 800,000 audio files containing music, sound effects, and single-instrument stems, as well as corresponding text metadata." After feeding 19,500 hours of audio into the model, Stable Audio knows how to imitate certain sounds it has heard on command because the sounds have been associated with text descriptions of them within its neural network.
There's nothing here…