Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

TL;DR

We introduce AURELIA, a novel actor-critic based audio-visual (AV) reasoning framework that distills structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark compris- ing 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge.

🔥Highlights

Key contributions of AURELIA:

We present AURELIA, a scalable and automated pipeline for generating high-quality Audio-Visual reasoning data, serving as both an evaluation resource and to the best of our knowledge, the first training-free reasoning distillation framework for Audio Visual LLMs

Leveraging our proposed reasoning data generation pipeline, we introduce AVReasonBench, a comprehensive AV benchmark featuring 4500 audio-visual samples with detailed step-by-step reasoning solutions across six diverse tasks, encompassing multimodal commonsense reasoning, music comprehension, and humor detection. Additionally, as a part of our benchmark, we introduce a novel task AV-GeoIQ for geographical understanding and curate 1000 AV-Compositional and 100 AV-Meme understanding samples through careful manual inspection.

Leveraging our curated reasoning dataset, we demonstrate up to 100% relative improvement in AVLLM performance through zero-shot reasoning distillation, demonstrating the effectiveness of our approach in enhancing the reasoning capabilities of AV models.

BibTeX

@article{chowdhury2025aurelia,
      title={Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs},
      author={Chowdhury, Sanjoy and Gani, Hanan and Anand, Nishit and Nag, Sayan and Gao, Ruohan and Elhoseiny, Mohamed and Khan, Salman and Manocha, Dinesh},
      journal={arXiv preprint arXiv:2503.23219},
      year={2025}
    }

Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs

ICCV 2025

TL;DR

🔥Highlights

AURELIA Framework

AVReasonBench: Audio-Visual Reasoning Benchmark

Quantitative Results

Qualitative Results

BibTeX