Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, Christian Frank
Google Research
Abstract We introduce MusicLM, a model generating high-fidelity music from text descriptions such as “a calming violin melody backed by a distorted guitar riff”. MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.
About 17 years ago, I was at work and I can’t remember exactly what happened, but I stood up, flexed both arms and said, “I’m Nicky 2 Guns,” and everybody lost it. I’m not big, lol.
@chaban - Wow - brilliant. That was a fascinating read. I’ll stick to JPGs.
Reminds me when I spotted something similar with (old 2012 era) Skype databases. Turns out if you used Skype to chat, then clicking “delete chat” didn’t clean out the database file. It just reset the pointer to let the next chat overwrite the “deleted” one. So if you had a shorter conversation the next time, the text of the longest chats were never overwritten in the file. That bug cost someone his marriage when I found all the chats he had while having an affair…
“What I would like to see,” he says, “is a central library of every recorded work. I’d like to sit down, pick pieces and compare versions, and I’d like to do this from home. One way would be via cable TV. The cable companies want to do this with movies, but that won’t be commercially viable for years. They’re aligning before the medium is ready. But with compression, the medium is ready for music right now. You could get 50 audio streams in one TV channel.”