
AVTRUSTBENCH statistics and AVLLMs leaderboard. (Left) Task-wise data distribution. Our benchmark comprises 9 diverse tasks spanning over 3 dimensions. (Right) Performance comparison on AVTRUSTBENCH. Values represent dimension-wise averages.
We introduce AVTrustBench as a comprehensive 600K-sample benchmark evaluating AVLLMs on adversarial attacks, compositional reasoning, and modality-specific dependencies. We further propose a new model-agnostic calibrated audio-visual preference optimization-based training strategy, CAVPref, that improves performance of existing AVLLMs by up to 30.19%.
We present AVTrustBench, a new benchmark comprising three challenging yet unexplored axes, i.e., Adversarial Attack, Compositional Reasoning, and Modality Dependency, and evaluate SOTA Audio-Visual LLMs (AVLLMs) on this benchmark. We observe that these models demonstrate poor performances under these settings. To alleviate these limitations, we propose a novel AVLLM-agnostic preference optimization strategy CAVPref, which substantially improves the reliability and robustness of these models over existing solutions such as DPO. [Representative Model: VideoLlama-2.]
With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models’ multimodal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark, we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, modelagnostic calibrated audio-visual preference optimizationbased training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks.
AVTRUSTBENCH statistics and AVLLMs leaderboard. (Left) Task-wise data distribution. Our benchmark comprises 9 diverse tasks spanning over 3 dimensions. (Right) Performance comparison on AVTRUSTBENCH. Values represent dimension-wise averages.
Task definitions: AVTrustBench comprises a total of 9 tasks tasks MCIT, ICIT, MVIT and MAIT from Adversarial Attack, COT-Stitch, COT-Swap and CAT from Compositional Reasoning and MAT and MVT from Modality-specific Dependency respectively. The goal of each dimension is to critically assess the robustness of existing AVLLMs under different modes of challenges. In each case, the AVLLMs are presented with a multiple-choice question setup. Refer to Sec. 3.1 in the paper for task-specific details.
Qualitative results: We report top 8 models’ performance on three representative tasks MCIT, COT-Swap and MAT. GPT-4o consistently outperforms open-source models. Under instruction setting we append the phrase “If the correct answer is not present respond with None of the above”. More qualitative results can be found in the supplementary.
Overview of CAVPref: We formulate a distributionally robust AV preferential optimization objective to incorporate the multi-modal relationships across different modalities and counter the tailing effect across diverse categories in the dataset.
Leaderboards for zero-shot evaluation on 9 different tasks in AVTRUSTBENCH.
Qualitative results on the 9 AVTrustBench tasks - MCIT, ICIT, MVIT, MAIT, MVT, MAT, COT-Stitch, COT-Swap and CAT.
@article{chowdhury2025avtrustbench,
title={AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs},
author={Chowdhury, Sanjoy and Nag, Sayan and Dasgupta, Subhrajyoti and Wang, Yaoting and Elhoseiny, Mohamed and Gao, Ruohan and Manocha, Dinesh},
journal={arXiv preprint arXiv:2501.02135},
year={2025}
}