We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our approach MeLFusion generates music waveform $w$ conditioned on an image $I$ and a given textual instruction $Y$. Visual semantics from $I$ is instilled into a text-to-music diffusion model (bottom green box) using a pre-trained and frozen text-to-image diffusion model (top blue box). The image $I$ is first DDIM inverted into a noisy latent $z^I_T$. The self-attention features from the decoder layers of the text-to-image LDM that consumes $z^I_T$ is infused into the cross-attention features of text-to-music LDM decoder layers, modulated by learned $\alpha$ parameters. This fusion operation that happens in the decoder (green stripes) is detailed on the right side of the figure. The music encoder projects the spectrogram representation of the music to the latent space, and the music decoder retrieves back the spectrograms. Finally, a vocoder generates the waveform $w$ from the spectrograms. Please refer to Section 3 in the manuscript for more details.
Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel visual synapse, which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work gathers attention to this pragmatic, yet relatively under-explored research area.
This is an orchestral music piece. The track can be played at the climax of a movie. | ||
This is an orchestral music piece. The track can be played at the climax of a movie. | ||
Sudden up tempo dramatic music amplifying the emotion of the scene followed by downtempo orchestra. | ||
This is an orchestral music piece. The track can be played at the climax of a movie. |
String orchestra playing a tune from horror genre. | ||
String orchestra playing a tune from horror genre. | ||
Haunting soundtrack. | ||
String orchestra playing a tune from horror genre. |
A soft musical track played on violin for a tranquil scenery is captured with the view of the night sky just before sunrise. The sky consists of dynamic spiraling clouds which symbolizes movement and aliveness. The stars are bright and prominent with strokes of yellows and whites represent a vivid yet peaceful moment. The village in the scene has houses whose windows emit warm and glowing light, giving a contrast to the cool, celestial tones of sky depicting pleasant emotions. | |||
A soft musical track played on violin for a tranquil scenery is captured with the view of the night sky just before sunrise. The sky consists of dynamic spiraling clouds which symbolizes movement and aliveness. The stars are bright and prominent with strokes of yellows and whites represent a vivid yet peaceful moment. The village in the scene has houses whose windows emit warm and glowing light, giving a contrast to the cool, celestial tones of sky depicting pleasant emotions. | |||
A soft musical track played on violin for a tranquil scenery is captured with the view of the night sky just before sunrise. The sky consists of dynamic spiraling clouds which symbolizes movement and aliveness. The stars are bright and prominent with strokes of yellows and whites represent a vivid yet peaceful moment. The village in the scene has houses whose windows emit warm and glowing light, giving a contrast to the cool, celestial tones of sky depicting pleasant emotions. | |||
@article{chowdhury2023melfusion,
author = {Chowdhury, Sanjoy and Nag, Sayan and K J, Joseph and Vasan Srinivasan, Balaji and Manocha, Dinesh},
title = {MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models},
journal = {CVPR},
year = {2024}
}