MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

Sanjoy Chowdhury1, Mohamed Elmoghany2, Yohan Abeysinghe3, Junjie Fei2, Sayan Nag4, Salman Khan3, Mohamed Elhoseiny2, Dinesh Manocha1
1University of Maryland, 2KAUST 3MBZUAI 4Adobe

TL;DR

We propose MAGNET, a multi-agent framework that retrieves and reasons over audio-visual cues from multiple videos to generate grounded, step-wise answers to complex queries.

MAGNET Overview

Our approach MAGNET generates step-wise, grounded answers to complex audio-visual queries by first retrieving the top-K relevant videos using AV-RAG, then processing them through dynamically instantiated audio-visual agents and a meta-agent aggregator. A salient frame selector module adaptively filters modality-agnostic keyframes, enhancing reasoning and generation quality. Please refer to Section 3 in the manuscript for more details.

Abstract

Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AVHaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework Magnet to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, StEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance.

Dataset

The benchmark comprises 103 hours of video content from 500 video samples across 27 diverse categories, accompanied by carefully annotated QA pairs that temporally ground salient segments within the videos. To the best of our knowledge, this is the first benchmark of its kind, as no prior work provides multi-video linked audio-visual QA pairs

AVHaystacksQA, and introduce AVHaystacks —a new benchmark consisting of 3100 audio-visual QA pairs drawn from videos across diverse domains (Fig. 1). This benchmark pushes the boundaries of video retrieval and reasoning by requiring models to navigate and reason over large-scale video collections. To the best of our knowledge, no existing benchmark systematically evaluates multi-video keypoint detection and reasoning capabilities.

Qualitative Results of MAGNET

Example 1

Example 2

MAGNET Ablations

Performance is generally best when both audio and visual modalities are used, highlighting the benefit of multi-modal information. Gemini-1.5-Pro consistently outperforms Qwen-2.5-Omni across all retrieval and response alignment scores metrics, indicating the benefits of MAGNET in formulating coherent and information-rich responses. Results underscore the advantage of semantically guided sampling over uniform strategies, as SFS more effectively captures informative segments, leading to better grounding, coherence, and human preference.

This demonstrates a steady rise in performance across all metrics as $\gamma$ increases, although a slight dip in performance is observed at $\gamma$ = 25, notably in BLEU@4 and Human Eval for ourframework + Qwen-2.5-Omni-FT, potentially indicate the onset of overfitting or increased parameter sensitivity in that region. The varying magnitudes of the dip across metrics indicate that the effect of $\gamma$ is not uniform across different aspects of model performance.

Main Results

Grounding evaluation and Step-wise error results on AVHaystack-50 and AVHaystack-Full datasets using MTGS and STEM (SM, SH, SO, SFP, SFN) metrics, respectively and Retrieval Evaluation Scores on AVHaystack-50 and AVHaystack-Full datasets.

Our proposed MAGNET offers significant gains over baseline approaches (first section) and our adapted baselines (second section) across multiple objective and subjective metrics on two dataset splits. B@4: BLEU@4, Cr: CIDEr, H Eval: Human Evaluation. Closed source model: as a reference for upper bound

BibTeX

@misc{chowdhury2025magnetmultiagentframeworkfinding,
      title={MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks}, 
      author={Sanjoy Chowdhury and Mohamed Elmoghany and Yohan Abeysinghe and Junjie Fei and Sayan Nag and Salman Khan and Mohamed Elhoseiny and Dinesh Manocha},
      year={2025},
      eprint={2506.07016},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.07016}, 
}