MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

CVPR 2024 [Highlight, Top 2.8%]


1University of Maryland, 2University of Toronto, 3Adobe Research

TL;DR

We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.

MelFusion Framework

Our approach MeLFusion generates music waveform $w$ conditioned on an image $I$ and a given textual instruction $Y$. Visual semantics from $I$ is instilled into a text-to-music diffusion model (bottom green box) using a pre-trained and frozen text-to-image diffusion model (top blue box). The image $I$ is first DDIM inverted into a noisy latent $z^I_T$. The self-attention features from the decoder layers of the text-to-image LDM that consumes $z^I_T$ is infused into the cross-attention features of text-to-music LDM decoder layers, modulated by learned $\alpha$ parameters. This fusion operation that happens in the decoder (green stripes) is detailed on the right side of the figure. The music encoder projects the spectrogram representation of the music to the latent space, and the music decoder retrieves back the spectrograms. Finally, a vocoder generates the waveform $w$ from the spectrograms. Please refer to Section 3 in the manuscript for more details.

Abstract

Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel visual synapse, which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work gathers attention to this pragmatic, yet relatively under-explored research area.

Dataset Examples


Qualitative Results of MeLFusion

Visual Prompt
Textual Prompt
Generated audio
An ambient composition with flute and guitar.
A solo violin piece that goes well with this oil painting.
Hip-Hop music for a setting with people in a jolly mood relaxing at poolside.
Jazz piano with mild drumming.
An upbeat soundtrack from heavy metal genre containing fast drumbeats and cymbal crashes.

MeLFusion Ablations

Example 1

Visual Prompt
Textual Prompt
Generated audio
None
None

This is an orchestral music piece. The track can be played at the climax of a movie.

This is an orchestral music piece. The track can be played at the climax of a movie.

Sudden up tempo dramatic music amplifying the emotion of the scene followed by downtempo orchestra.

This is an orchestral music piece. The track can be played at the climax of a movie.

Example 2

Visual Prompt
Textual Prompt
Generated audio
None
None

String orchestra playing a tune from horror genre.

String orchestra playing a tune from horror genre.

Haunting soundtrack.

String orchestra playing a tune from horror genre.

Comparison of MeLFusion with other approaches

Method
Visual Prompt
Textual Prompt
Generated audio
MusicLM
None

A soft musical track played on violin for a tranquil scenery is captured with the view of the night sky just before sunrise. The sky consists of dynamic spiraling clouds which symbolizes movement and aliveness. The stars are bright and prominent with strokes of yellows and whites represent a vivid yet peaceful moment. The village in the scene has houses whose windows emit warm and glowing light, giving a contrast to the cool, celestial tones of sky depicting pleasant emotions.

Moûsai
None

A soft musical track played on violin for a tranquil scenery is captured with the view of the night sky just before sunrise. The sky consists of dynamic spiraling clouds which symbolizes movement and aliveness. The stars are bright and prominent with strokes of yellows and whites represent a vivid yet peaceful moment. The village in the scene has houses whose windows emit warm and glowing light, giving a contrast to the cool, celestial tones of sky depicting pleasant emotions.

MusicGen
None

A soft musical track played on violin for a tranquil scenery is captured with the view of the night sky just before sunrise. The sky consists of dynamic spiraling clouds which symbolizes movement and aliveness. The stars are bright and prominent with strokes of yellows and whites represent a vivid yet peaceful moment. The village in the scene has houses whose windows emit warm and glowing light, giving a contrast to the cool, celestial tones of sky depicting pleasant emotions.

MeLFusion (Ours)
A soft musical track of folk acoustic genre played on violin.

BibTeX

@article{chowdhury2023melfusion,
  author    = {Chowdhury, Sanjoy and Nag, Sayan and K J, Joseph and Vasan Srinivasan, Balaji and Manocha, Dinesh},
  title     = {MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models},
  journal   = {CVPR},
  year      = {2024}
}