AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

TL;DR

We introduce AVTrustBench as a comprehensive 600K-sample benchmark evaluating AVLLMs on adversarial attacks, compositional reasoning, and modality-specific dependencies. We further propose a new model-agnostic calibrated audio-visual preference optimization-based training strategy, CAVPref, that improves performance of existing AVLLMs by up to 30.19%.

Abstract

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models’ multimodal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark, we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, modelagnostic calibrated audio-visual preference optimizationbased training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks.

AVTrustBench Task Examples

Task definitions: AVTrustBench comprises a total of 9 tasks tasks MCIT, ICIT, MVIT and MAIT from Adversarial Attack, COT-Stitch, COT-Swap and CAT from Compositional Reasoning and MAT and MVT from Modality-specific Dependency respectively. The goal of each dimension is to critically assess the robustness of existing AVLLMs under different modes of challenges. In each case, the AVLLMs are presented with a multiple-choice question setup. Refer to Sec. 3.1 in the paper for task-specific details.

CAVPref Architecture

Overview of CAVPref: We formulate a distributionally robust AV preferential optimization objective to incorporate the multi-modal relationships across different modalities and counter the tailing effect across diverse categories in the dataset.

Qualitative Results

Qualitative results on the 9 AVTrustBench tasks - MCIT, ICIT, MVIT, MAIT, MVT, MAT, COT-Stitch, COT-Swap and CAT.

BibTeX

@article{chowdhury2025avtrustbench,
  title={AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs},
  author={Chowdhury, Sanjoy and Nag, Sayan and Dasgupta, Subhrajyoti and Wang, Yaoting and Elhoseiny, Mohamed and Gao, Ruohan and Manocha, Dinesh},
  journal={arXiv preprint arXiv:2501.02135},
  year={2025}
}

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

ICCV 2025

TL;DR

Abstract

AVTrustBench Statistics and AVLLMs Leaderboard

AVTrustBench Task Examples

Zero-shot Qualitative Results

CAVPref Architecture

Task-wise Leaderboards

Qualitative Results

BibTeX