MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

Abstract

Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, we propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel visual synapse, which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music, measured both objectively and subjectively, with a relative gain of up to 67.98% on the FAD score. We hope that our work gathers attention to this pragmatic, yet relatively under-explored research area.

Visual Prompt	Textual Prompt	Generated audio
	An upbeat soundtrack from heavy metal genre containing fast drumbeats and cymbal crashes.

Visual Prompt

Textual Prompt

Generated audio

An ambient composition with flute and guitar.

A solo violin piece that goes well with this oil painting.

Hip-Hop music for a setting with people in a jolly mood relaxing at poolside.

Jazz piano with mild drumming.

An upbeat soundtrack from heavy metal genre containing fast drumbeats and cymbal crashes.

Visual Prompt

Textual Prompt

Generated audio

None

This is an orchestral music piece. The track can be played at the climax of a movie.

This is an orchestral music piece. The track can be played at the climax of a movie.

Sudden up tempo dramatic music amplifying the emotion of the scene followed by downtempo orchestra.

This is an orchestral music piece. The track can be played at the climax of a movie.

Visual Prompt

Textual Prompt

Generated audio

None

String orchestra playing a tune from horror genre.

String orchestra playing a tune from horror genre.

Haunting soundtrack.

String orchestra playing a tune from horror genre.

Method

Visual Prompt

Textual Prompt

Generated audio

MusicLM

None

A soft musical track played on violin for a tranquil scenery is captured with the view of the night sky just before sunrise. The sky consists of dynamic spiraling clouds which symbolizes movement and aliveness. The stars are bright and prominent with strokes of yellows and whites represent a vivid yet peaceful moment. The village in the scene has houses whose windows emit warm and glowing light, giving a contrast to the cool, celestial tones of sky depicting pleasant emotions.

Moûsai

None

MusicGen

None

MeLFusion (Ours)

A soft musical track of folk acoustic genre played on violin.

@article{chowdhury2023melfusion, author = {Chowdhury, Sanjoy and Nag, Sayan and K J, Joseph and Vasan Srinivasan, Balaji and Manocha, Dinesh}, title = {MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models}, journal = {CVPR}, year = {2024} }

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

CVPR 2024 [Highlight, Top 2.8%]

TL;DR

MelFusion Framework

Abstract

Dataset Examples

Qualitative Results of MeLFusion

MeLFusion Ablations

Example 1

Example 2

Comparison of MeLFusion with other approaches

BibTeX