Sanjoy Chowdhury
I am a third year CS PhD student at University of Maryland, College Park advised by Prof. Dinesh Manocha. I am broadly interested in multi-modal learning and its different applications. My research primarily involves studying the interplay between the vision and audio modalities and developing systems equipped with their comprehensive understanding.
I am currently working as a research scientist intern at Meta Reality Labs. Before this, I was a student researcher at Google Research with Avisek Lahiri and Vivek Kwatra in the Talking heads team on speech driven facial synthesis. Previously, I spent a wonderful summer with Adobe Research working with Joseph K J in the Multi-modal AI team as a research PhD intern on multi-modal audio generation. I am also fortunate to have had the chance to work with Prof. Kristen Grauman , Prof. Mohamed Elhoseiny and Ruohan Gao among other wonderful mentors and collaborators.
Before joining for PhD, I was working as a Machine Learning Scientist with the Camera and Video AI team at ShareChat, India. I was also a visiting researcher at the Computer Vision and Pattern Recognition Unit at Indian Statistical Institute Kolkata under Prof. Ujjwal Bhattacharya. Even before, I was a Senior Research Engineer with the Vision Intelligence Group at Samsung R&D Institute Bangalore. I primarily worked on developing novel AI-powered solutions for different smart devices of Samsung.
I received my MTech in Computer Science & Engineering from IIIT Hyderabad where I was fortunate to be advised by Prof. C V Jawahar. During my undergrad, I worked as a research intern under Prof. Pabitra Mitra at IIT Kharagpur and the CVPR Unit at ISI Kolkata.
Email /
GitHub /
Google Scholar /
LinkedIn /
Twitter
|
|
Updates
-
July 2024 - Work on Audio-Visual LLM got accepted to ECCV 2024
-
June 2024 - Invited talk at the Sight and Sound workshop at CVPR 2024
-
May 2024 - Joined Meta Reality Labs as a Research Scientist intern.
-
May 2024 - Paper on Improving Robustness Against Spurious Correlations got accepted to ACL 2024 Findings
-
May 2024 - Our paper on determining perceived audience intent from multi-modal social media posts got accepted to Nature Scientific Reports
-
Mar 2024 - Paper on LLM guided navigational instruction generation got accepted to NAACL 2024
-
Feb 2024 - MeLFusion ( Highlight, Top 2.8% ) got accepted to CVPR 2024
-
Feb 2024 - Joined Google Research as a student researcher.
-
Oct 2023 - APoLLo gets accepted to EMNLP 2023
-
Oct 2023 - Invited talk on AdVerb at AV4D Workshop, ICCV 2023
-
July 2023 - AdVerb got accepted to ICCV 2023
-
May 2023 - Joined Adobe Research as a research intern.
-
Aug 2022 - Joined as a CS PhD student at University of Maryland College Park . Awarded Dean's fellowship.
-
Oct 2021 - Paper on audio-visual summarization accepted in BMVC 2021.
-
Sep 2021 - Blog on Video Quality Enhancement released at Tech @ ShareChat.
-
July 2021 - Paper on reflection removal got accepted in ICCV 2021.
-
June 2021 - Joined ShareChat Data Science team.
-
May 2021 - Paper on audio-visual joint segmentation accepted in ICIP 2021.
-
Dec 2018 - Accepted Samsung Research offer. Will be joining in June'19.
-
Sep 2018 - Received Dean's Merit List Award for academic excellence at IIIT Hyderabad.
-
Oct 2017 - Our work on a multi-scale, low-latency face detection framework received Best Paper Award at NGCT-2017.
|
Selected publications
I am interested in solving computer vision, computer audition, and machine learning problems and applying them to broad AI applications. My research focuses on applying multi-modal learning (Vision + X) for generative modeling and holistic cross-modal understanding with minimal supervision. Representative papers are highlighted.
|
|
Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
Sanjoy Chowdhury*, Sayan Nag*, Subhrajyoti Dasgupta*, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha
European Conference on Computer Vision (ECCV), 2024
Paper/
Project Page (coming soon) /
We present Meerkat, an audio-visual LLM equipped with a
fine-grained understanding of image and audio both spatially and temporally.
With a new modality alignment module based on optimal transport and a
cross-attention module that enforces audio-visual consistency, Meerkat can
tackle challenging tasks such as audio referred image grounding, image guided
audio temporal localization, and audio-visual fact-checking. Moreover, we
carefully curate a large dataset AVFIT that comprises 3M instruction tuning
samples collected from open-source datasets, and introduce MeerkatBench that
unifies five challenging audio-visual tasks.
|
|
Towards Determining Perceived Human Intent for Multimodal Social Media Posts using The Theory of Reasoned Action
Trisha Mittal, Sanjoy Chowdhury, Pooja Guhan, Snikhita Chelluri, Dinesh Manocha
Nature Scientific Reports
Paper /
Dataset
We propose Intent-o-meter, a perceived human intent prediction model for multimodal (image and text) social media posts. Intent-o-meter models ideas from psychology and cognitive modeling literature, in addition to using the visual and textual features for an improved perceived intent prediction.
|
|
Can LLM’s Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis
Vishnu Sashank Dorbala, Sanjoy Chowdhury, Dinesh Manocha
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Paper
We present a novel approach to automatically synthesize “wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references.
|
|
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models ( Highlight, Top 2.8% )
Sanjoy Chowdhury*, Sayan Nag*, Joseph KJ, Balaji Vasan Srinivasan, Dinesh Manocha
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Paper/
Project Page /
Poster /
Video /
Dataset /
Code
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM.
|
|
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models
Sanjoy Chowdhury*, Sayan Nag*, Dinesh Manocha
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Paper /
Project Page /
Poster /
Video /
Code
Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities.
|
|
AdVerb: Visually Guided Audio Dereverberation
Sanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
International Conference on Computer Vision (ICCV), 2023
Paper /
Project Page /
Video /
Poster /
Code
We present a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio.
|
|
Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation
Jiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, Soumyadip Sengupta
International Conference on Computational Photography (ICCP), 2023
Paper /
Project Page /
Dataset
In order to comprehensively evaluate albedo, we collect a new dataset, Measured Albedo in the Wild (MAW), and propose three new metrics that complement WHDR
|
|
AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary Generation
Sanjoy Chowdhury*, Aditya P. Patra*, Subhrajyoti Dasgupta, Ujjwal Bhattacharya
British Machine Vision Conference (BMVC), 2021
Paper /
Code /
Presentation
Introduced a novel deep reinforcement learning-based self-supervised audio-visual summarization model that leverages both audio and visual information to generate diverse yet semantically meaningful summaries.
|
|
V-DESIRR: Very Fast Deep Embedded Single Image Reflection Removal
B H Pawan Prasad, Green Rosh K S, Lokesh R B, Kaushik Mitra, Sanjoy Chowdhury
International Conference on Computer Vision (ICCV), 2021
Paper /
Code
We have proposed a multi-scale end-to-end architecture for detecting and removing weak, medium, and strong reflections from naturally occurring images.
|
|
Listen to the Pixels
Sanjoy Chowdhury, Subhrajyoti Dasgupta, Sudip Das, Ujjwal Bhattacharya
International Conference on Image Processing (ICIP), 2021
Paper /
Code /
Presentation
In this study, we exploited the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner.
|
Blog(s)
Have tried my hand at writing technical blogs.
|
|
The devil is in the details: Video Quality Enhancement Approaches
Link
The blog contextualizes the problem of video enhancement in present-day scenarios and talks about a couple of interesting approaches to handle this challenging task.
|
Academic services
I have served as a reviewer for the following conferences:
CVPR: 2023, '24
ICCV: 2023
ECCV: 2024
NeurIPS: 2024
WACV: 2022, '23, '24
ACMMM: 2023, '24
ACL: 2024
|
Affiliations
IIT Kharagpur Apr-Sep 2016
|
ISI Kolkata Feb-July 2017
|
IIIT Hyderabad Aug 2017 - May 2019
|
Mentor Graphics Hyderabad May - July 2018
|
Samsung Research Bangalore June 2019 - June 2021
|
ShareChat Bangalore June 2021 - May 2022
|
UMD College Park Aug 2022 - Present
|
Adobe Research May 2023 - Aug 2023
|
KAUST Jan 2024 - Present
|
Google Research Feb 2024 - May 2024
|
Meta AI May 2024 - Nov 2024
|
|
|