Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

TL;DR

We propose Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally.

Abstract

Leveraging Large Language Models’ remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a crossattention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

AVFIT-3M Dataset Distribution

Task-wise dataset distribution. The bi-coloured cells denote collections of paired image-audio samples from public datasets following our data curation strategy while single-coloured cells signify direct adaptation. Datasets with dashed outlines are used only during model training while the ones with ★ are reserved for zero-shot evaluations. Other datasets have a defined train/test split. Numbers in the bottom right represent the total #samples present in each task.

Meerkat Architecture

Overview of Meerkat. Our model is equipped with fine-grained audiovisual comprehension abilities. When fed with image I, audio A pairs, the Audio-Visual Optimal Transport alignment (AVOpT) module B learns the patch-wise image-audio association to facilitate weak alignment between the two modalities by minimizing the patch-level Wasserstein distance. Subsequently, the Audio-Visual Attention Consistency Enforcement (AVACE) module A maximizes the region-level alignment by confining the cross-modal attention maps around the objects of interest and minimizing the association with the background. After tokenizing the text instruction T, the modality-specific latents (z˜I , z˜A, zT ) are passed to the instruction tuned Llama 2 model which serves as a unified interface for the downstream tasks. We employ a LoRA-based fine-tuning of the LLM.

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

ECCV 2024

TL;DR

Abstract

AVFIT-3M Dataset Distribution

Meerkat Architecture

Qualitative Results of Meerkat

AVFIT Statistics

BibTeX