PyTorch Library for Deep Learning Research on Audio Generation


To enable researchers and practitioners to train their models and advance state of the art, Meta has released the source code for its text-to-music generative AI, AudioCraft. MusicGen, AudioGen, and EnCodec are the three models that comprise the AudioCraft framework for development. 

  • MusicGen can generate music based on textual user inputs because it was trained with Meta-owned and specifically licensed music.
  • AudioGen can create audio from text inputs and be trained in public sound effects. 
  • EnCodec is a three-in-one AI-driven encoder, quantizer, and decoder.

AudioGen can create audio from text inputs and be trained in public sound effects. A new and improved version of the EnCodec decoder is being released by Meta, allowing for higher quality music generation with fewer artifacts, as well as the pre-trained AudioGen model, which can be used to generate environmental sounds and sound effects such as a dog barking, cars honking, or footsteps on a wooden floor, and all of the weights and code for the AudioCraft model. Researchers interested in learning more about the technology can use the models. Meta is thrilled to make its platform available to researchers and practitioners for the first time, allowing them to train their models with their datasets and contribute to the state of the art.

After being trained, it may produce realistic and high-quality music or sound effects based on the words the user enters. MusicGen, AudioGen, and EnCodec are the three models found in AudioCraft. MusicGen and AudioGen can generate music and sound effects from text based on their respective training sets. MusicGen uses Meta’s own and permitted music, and AudioGen uses public sound data sets. Meta released two models in June and October of 2017: MusicGen and AudioGen.

Meta claims that with its intuitive interface, AudioCraft can produce professional-grade sound. They also claim that it streamlines the current audio generation state-of-the-art design by employing a fresh method. They detail how AudioCraft uses the EnCodec neural audio codec to extract meaningful information from the raw audio data. After this, an autoregressive language model is fed a predetermined “vocabulary” of musical samples (audio tokens). Important for creating music, this model trains a new audio language model by exploiting the tokens’ underlying structure to capture their long-term relationships. Tokens based on the textual description are generated by the new model and sent back to the EnCodec decoder, allowing for the synthesis of audio and music.

Meta demonstrates how AudioGen is unique compared to conventional AI music generators. Symbolic representations of music, such as MIDI or piano-punched paper rolls, have been used for a long time in music training to produce AI models. However, these approaches must be revised when recording musical expression’s subtleties and aesthetic components. A more complex approach involves feeding the original music into the system and using self-supervised audio representation learning (audio representation learning) and multiple hierarchical (cascaded model) models to generate music, all to capture the signal’s longer-range structure. Good sound is produced, although the effects might use some work.

In accordance with the principles of Responsible AI, meta-researchers are making AudioGen and MusicGen model cards, which document how they developed the models, available to the research community in various sizes. The audio research framework and training code are open to the public under the MIT license so that others may use and expand upon it. Meta thinks such models could be useful for amateur and professional musicians if more sophisticated controls were developed. Think of the possibilities for enhanced bedtime story readings with sound effects and dramatic music that could be made possible with a robust open-source basis.