Selected publications

I am interested in solving computer vision, computer audition, and machine learning problems and applying them to broad AI applications. My research focuses on applying multi-modal learning (Vision + X) for generative modeling and holistic cross-modal understanding with minimal supervision. Representative papers are highlighted.

project image

project image AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs


Sanjoy Chowdhury*, Hanan Gani*, Nishit Anand, Sayan Nag, Ruohan Gao, Mohamed Elhoseiny, Salman Khan, Dinesh Manocha

ArXiv PrePrint / Project Page and Dataset (Coming soon!)

In this paper, we introduce AURELIA, a novel actor-critic based audio-visual reasoning framework that distils structured, step-by-step reasoning into AVLLMs at test time, improving their ability to process complex multi-modal inputs without additional training or fine-tuning. To further advance AVLLM reasoning skills, we present AVReasonBench, a challenging benchmark comprising 4500 audio-visual questions, each paired with detailed step-by-step reasoning. Our benchmark spans six distinct tasks, including AV-GeoIQ, which evaluates AV reasoning combined with geographical and cultural knowledge

project image

project image AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs


Sanjoy Chowdhury*, Sayan Nag*, Subhrajyoti Dasgupta, Yaoting Wang, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

ArXiv PrePrint / Project Page and Dataset (Coming soon!)

We introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 16 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks.

project image

project image project imageMeerkat: Audio-Visual Large Language Model for Grounding in Space and Time


Sanjoy Chowdhury*, Sayan Nag*, Subhrajyoti Dasgupta*, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha
European Conference on Computer Vision (ECCV), 2024
Paper/ Project Page / Poster / Video / Dataset / Code

We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks.

project image

ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations


Sreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar, Utkarsh Tyagi, Sakshi Singh, Sanjoy Chowdhury, Dinesh Manocha
ACL Findings 2024
Paper / Code

The paper proposes a simple yet effective solution for supplementing the training dataset with images without spurious features, for robust learning against spurious correlations via better generalization

project image

Towards Determining Perceived Human Intent for Multimodal Social Media Posts using The Theory of Reasoned Action


Trisha Mittal, Sanjoy Chowdhury, Pooja Guhan, Snikhita Chelluri, Dinesh Manocha
Nature Scientific Reports
Paper / Dataset

We propose Intent-o-meter, a perceived human intent prediction model for multimodal (image and text) social media posts. Intent-o-meter models ideas from psychology and cognitive modeling literature, in addition to using the visual and textual features for an improved perceived intent prediction.

project image

Can LLM’s Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis


Vishnu Sashank Dorbala, Sanjoy Chowdhury, Dinesh Manocha
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Paper

We present a novel approach to automatically synthesize “wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references.

project image

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models ( Highlight, Top 2.8% )


Sanjoy Chowdhury*, Sayan Nag*, Joseph KJ, Balaji Vasan Srinivasan, Dinesh Manocha
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Paper/ Project Page / Poster / Video / Dataset / Code

We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM.

project image

APoLLo project image: Unified Adapter and Prompt Learning for Vision Language Models


Sanjoy Chowdhury*, Sayan Nag*, Dinesh Manocha
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Paper / Project Page / Poster / Video / Code

Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities.

project image

AdVerb: Visually Guided Audio Dereverberation


Sanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
International Conference on Computer Vision (ICCV), 2023
Paper / Project Page / Video / Poster / Code

We present a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio.

project image

Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation


Jiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, Soumyadip Sengupta
International Conference on Computational Photography (ICCP), 2023
Paper / Project Page / Dataset

In order to comprehensively evaluate albedo, we collect a new dataset, Measured Albedo in the Wild (MAW), and propose three new metrics that complement WHDR

project image

AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary Generation


Sanjoy Chowdhury*, Aditya P. Patra*, Subhrajyoti Dasgupta, Ujjwal Bhattacharya
British Machine Vision Conference (BMVC), 2021
Paper / Code / Presentation

Introduced a novel deep reinforcement learning-based self-supervised audio-visual summarization model that leverages both audio and visual information to generate diverse yet semantically meaningful summaries.

project image

V-DESIRR: Very Fast Deep Embedded Single Image Reflection Removal


B H Pawan Prasad, Green Rosh K S, Lokesh R B, Kaushik Mitra, Sanjoy Chowdhury
International Conference on Computer Vision (ICCV), 2021
Paper / Code

We have proposed a multi-scale end-to-end architecture for detecting and removing weak, medium, and strong reflections from naturally occurring images.

project image

Listen to the Pixels


Sanjoy Chowdhury, Subhrajyoti Dasgupta, Sudip Das, Ujjwal Bhattacharya
International Conference on Image Processing (ICIP), 2021
Paper / Code / Presentation

In this study, we exploited the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner.