Selected publicationsI am interested in solving computer vision, computer audition, and machine learning problems and applying them to broad AI applications. My research focuses on applying multi-modal learning (Vision + X) for generative modeling and holistic cross-modal understanding with minimal supervision. Representative papers are highlighted. |
![]() |
|
![]() |
|
![]() |
Sanjoy Chowdhury*, Sayan Nag*, Subhrajyoti Dasgupta*, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha European Conference on Computer Vision (ECCV), 2024 Paper/ Project Page / Poster / Video / Dataset / Code We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. |
![]() |
ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious CorrelationsSreyan Ghosh*, Chandra Kiran Reddy Evuru*, Sonal Kumar, Utkarsh Tyagi, Sakshi Singh, Sanjoy Chowdhury, Dinesh Manocha ACL Findings 2024 Paper / Code The paper proposes a simple yet effective solution for supplementing the training dataset with images without spurious features, for robust learning against spurious correlations via better generalization |
![]() |
Towards Determining Perceived Human Intent for Multimodal Social Media Posts using The Theory of Reasoned ActionTrisha Mittal, Sanjoy Chowdhury, Pooja Guhan, Snikhita Chelluri, Dinesh Manocha Nature Scientific Reports Paper / Dataset We propose Intent-o-meter, a perceived human intent prediction model for multimodal (image and text) social media posts. Intent-o-meter models ideas from psychology and cognitive modeling literature, in addition to using the visual and textual features for an improved perceived intent prediction. |
![]() |
Can LLM’s Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction SynthesisVishnu Sashank Dorbala, Sanjoy Chowdhury, Dinesh Manocha Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024 Paper We present a novel approach to automatically synthesize “wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. |
![]() |
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models ( Highlight, Top 2.8% )Sanjoy Chowdhury*, Sayan Nag*, Joseph KJ, Balaji Vasan Srinivasan, Dinesh Manocha Conference on Computer Vision and Pattern Recognition (CVPR), 2024 Paper/ Project Page / Poster / Video / Dataset / Code We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM. |
![]() |
APoLLo
|
![]() |
AdVerb: Visually Guided Audio DereverberationSanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha International Conference on Computer Vision (ICCV), 2023 Paper / Project Page / Video / Poster / Code We present a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. |
![]() |
Measured Albedo in the Wild: Filling the Gap in Intrinsics EvaluationJiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, Soumyadip Sengupta International Conference on Computational Photography (ICCP), 2023 Paper / Project Page / Dataset In order to comprehensively evaluate albedo, we collect a new dataset, Measured Albedo in the Wild (MAW), and propose three new metrics that complement WHDR |
![]() |
AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary GenerationSanjoy Chowdhury*, Aditya P. Patra*, Subhrajyoti Dasgupta, Ujjwal Bhattacharya British Machine Vision Conference (BMVC), 2021 Paper / Code / Presentation Introduced a novel deep reinforcement learning-based self-supervised audio-visual summarization model that leverages both audio and visual information to generate diverse yet semantically meaningful summaries. |
![]() |
V-DESIRR: Very Fast Deep Embedded Single Image Reflection RemovalB H Pawan Prasad, Green Rosh K S, Lokesh R B, Kaushik Mitra, Sanjoy Chowdhury International Conference on Computer Vision (ICCV), 2021 Paper / Code We have proposed a multi-scale end-to-end architecture for detecting and removing weak, medium, and strong reflections from naturally occurring images. |
![]() |
Listen to the PixelsSanjoy Chowdhury, Subhrajyoti Dasgupta, Sudip Das, Ujjwal Bhattacharya International Conference on Image Processing (ICIP), 2021 Paper / Code / Presentation In this study, we exploited the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner. |