Creating musical compositions from text descriptions, such as “90s rock song with a guitar riff,” is text-to-music. Since it involves simulating long-range processes, making music is a difficult task. Music, as opposed to speech, calls for the utilization of the entire frequency range. This entails sampling the signal more often; for example, music recordings typically use sample rates of 44.1 kHz or 48 kHz instead of 16 kHz for speech. Additionally, the harmonies and melodies of several instruments combine to form intricate structures in music. Human listeners are extremely sensitive to discord. Thus, there is little opportunity for melodic mistakes while creating music.
Last but not least, it is crucial for music producers to have the power to regulate the generating process using various tools, including keys, instruments, melody, genre, etc. Recent developments in audio synthesis, sequential modeling, and self-supervised audio representation learning make the framework for creating such models possible. Recent research suggested expressing audio signals as several streams of discrete tokens representing the same signal to make audio modeling more manageable. This enables both efficient audio modeling and high-quality audio generation. This, however, entails jointly modeling several dependent parallel streams.
Researchers have suggested modeling several concurrent speech token streams using a delay method or by adding offsets between the various streams. Others suggested modeling musical parts using a hierarchy of autoregressive models and displaying them using several sequences of discrete tokens at varying granularities. Parallel to this, several researchers use a similar strategy to generate singing to accompaniment. Researchers have suggested breaking this problem into two stages: (i) modeling just the initial stream of tokens and (ii) using a post-network to jointly model the remainder of the streams in a non-autoregressive way. Researchers from Meta AI introduce MUSICGEN in this study, a straightforward and controlled music generation model that can produce high-quality music from a written description.
As a generalization of earlier research, they provide a generic framework for modeling numerous concurrent streams of acoustic tokens. They also incorporate unsupervised melody conditioning, which enables the model to produce music that fits a specific harmonic and melodic structure to increase the controllability of the created samples. They thoroughly studied MUSICGEN and demonstrated that it is far better than the analyzed baselines, giving it a subjective grade of 84.8 out of 100 compared to the best baseline’s 80.5. They also offer ablation research that clarifies the significance of each component on the performance of the entire model.
Last, the human evaluation indicates that MUSICGEN produces high-quality samples that are more melodically aligned with a specific harmonic structure and adhere to a written description. Their involvement: (i) They present a straightforward and effective methodology to produce high-quality music at 32 kHz. They demonstrate how MUSICGEN can create reliable music using a single-stage language model and a successful codebook interleaving technique. (ii) They provide a single model to carry out both text-conditioned generation and melody-conditioned generation, and they show that the generated audio is loyal to the text-conditioning information and consistent with the given tune. (iii) They offer in-depth assessments of their method’s fundamental design decisions that are both objective and subjective. The PyTorch implementation of the code for MusicGen is available in the AudioCraft library on GitHub.