Sanjoy Chowdhury

I am a third year CS PhD student at University of Maryland, College Park advised by Prof. Dinesh Manocha. I am broadly interested in multi-modal learning and its different applications. My research primarily involves studying the interplay between the vision and audio modalities and developing systems equipped with their comprehensive understanding.

I spent the summer of '24 at Meta Reality Labs working as a research scientist intern hosted by Ruohan Gao . Before this, I was a student researcher at Google Research with Avisek Lahiri and Vivek Kwatra in the Talking heads team on speech driven facial synthesis. Previously, I spent a wonderful summer with Adobe Research working with Joseph K J in the Multi-modal AI team as a research PhD intern on multi-modal audio generation. I am also fortunate to have had the chance to work with Prof. Kristen Grauman , Prof. Mohamed Elhoseiny among other wonderful mentors and collaborators.

Before joining for PhD, I was working as a Machine Learning Scientist with the Camera and Video AI team at ShareChat, India. I was also a visiting researcher at the Computer Vision and Pattern Recognition Unit at Indian Statistical Institute Kolkata under Prof. Ujjwal Bhattacharya. Even before, I was a Senior Research Engineer with the Vision Intelligence Group at Samsung R&D Institute Bangalore. I primarily worked on developing novel AI-powered solutions for different smart devices of Samsung.

I received my MTech in Computer Science & Engineering from IIIT Hyderabad where I was fortunate to be advised by Prof. C V Jawahar. During my undergrad, I worked as a research intern under Prof. Pabitra Mitra at IIT Kharagpur and the CVPR Unit at ISI Kolkata.

Email  /  GitHub  /  Google Scholar  /  LinkedIn  /  Twitter

profile photo

Updates

  • Oct 2024 - Invited talk on assessing and addressing the gaps in existing Audio-Visual LLMs at AIR lab at University of Rochester
  • July 2024 - Work on Audio-Visual LLM got accepted to ECCV 2024 project image
  • June 2024 - Invited talk at the Sight and Sound workshop at CVPR 2024
  • May 2024 - Joined Meta Reality Labs as a Research Scientist intern. project image
  • May 2024 - Paper on Improving Robustness Against Spurious Correlations got accepted to ACL 2024 Findings
  • May 2024 - Our paper on determining perceived audience intent from multi-modal social media posts got accepted to Nature Scientific Reports
  • Mar 2024 - Paper on LLM guided navigational instruction generation got accepted to NAACL 2024
  • Feb 2024 - MeLFusion ( Highlight, Top 2.8% ) got accepted to CVPR 2024
  • Feb 2024 - Joined Google Research as a student researcher.
  • Oct 2023 - APoLLo gets accepted to EMNLP 2023
  • Oct 2023 - Invited talk on AdVerb at AV4D Workshop, ICCV 2023
  • July 2023 - AdVerb got accepted to ICCV 2023
  • May 2023 - Joined Adobe Research as a research intern.
  • Aug 2022 - Joined as a CS PhD student at University of Maryland College Park . Awarded Dean's fellowship.
  • Oct 2021 - Paper on audio-visual summarization accepted in BMVC 2021.
  • Sep 2021 - Blog on Video Quality Enhancement released at Tech @ ShareChat.
  • July 2021 - Paper on reflection removal got accepted in ICCV 2021.
  • June 2021 - Joined ShareChat Data Science team.
  • May 2021 - Paper on audio-visual joint segmentation accepted in ICIP 2021.
  • Dec 2018 - Accepted Samsung Research offer. Will be joining in June'19.
  • Sep 2018 - Received Dean's Merit List Award for academic excellence at IIIT Hyderabad.
  • Oct 2017 - Our work on a multi-scale, low-latency face detection framework received Best Paper Award at NGCT-2017.



Selected publications

I am interested in solving computer vision, computer audition, and machine learning problems and applying them to broad AI applications. My research focuses on applying multi-modal learning (Vision + X) for generative modeling and holistic cross-modal understanding with minimal supervision. Representative papers are highlighted.

project image

project image project imageMeerkat: Audio-Visual Large Language Model for Grounding in Space and Time


Sanjoy Chowdhury*, Sayan Nag*, Subhrajyoti Dasgupta*, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha
European Conference on Computer Vision (ECCV), 2024
Paper/ Project Page (coming soon) /

We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks.

project image

Towards Determining Perceived Human Intent for Multimodal Social Media Posts using The Theory of Reasoned Action


Trisha Mittal, Sanjoy Chowdhury, Pooja Guhan, Snikhita Chelluri, Dinesh Manocha
Nature Scientific Reports
Paper / Dataset

We propose Intent-o-meter, a perceived human intent prediction model for multimodal (image and text) social media posts. Intent-o-meter models ideas from psychology and cognitive modeling literature, in addition to using the visual and textual features for an improved perceived intent prediction.

project image

Can LLM’s Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis


Vishnu Sashank Dorbala, Sanjoy Chowdhury, Dinesh Manocha
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Paper

We present a novel approach to automatically synthesize “wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references.

project image

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models ( Highlight, Top 2.8% )


Sanjoy Chowdhury*, Sayan Nag*, Joseph KJ, Balaji Vasan Srinivasan, Dinesh Manocha
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Paper/ Project Page / Poster / Video / Dataset / Code

We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM.

project image

APoLLo project image: Unified Adapter and Prompt Learning for Vision Language Models


Sanjoy Chowdhury*, Sayan Nag*, Dinesh Manocha
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Paper / Project Page / Poster / Video / Code

Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities.

project image

AdVerb: Visually Guided Audio Dereverberation


Sanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
International Conference on Computer Vision (ICCV), 2023
Paper / Project Page / Video / Poster / Code

We present a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio.

project image

Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation


Jiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, Soumyadip Sengupta
International Conference on Computational Photography (ICCP), 2023
Paper / Project Page / Dataset

In order to comprehensively evaluate albedo, we collect a new dataset, Measured Albedo in the Wild (MAW), and propose three new metrics that complement WHDR

project image

AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary Generation


Sanjoy Chowdhury*, Aditya P. Patra*, Subhrajyoti Dasgupta, Ujjwal Bhattacharya
British Machine Vision Conference (BMVC), 2021
Paper / Code / Presentation

Introduced a novel deep reinforcement learning-based self-supervised audio-visual summarization model that leverages both audio and visual information to generate diverse yet semantically meaningful summaries.

project image

V-DESIRR: Very Fast Deep Embedded Single Image Reflection Removal


B H Pawan Prasad, Green Rosh K S, Lokesh R B, Kaushik Mitra, Sanjoy Chowdhury
International Conference on Computer Vision (ICCV), 2021
Paper / Code

We have proposed a multi-scale end-to-end architecture for detecting and removing weak, medium, and strong reflections from naturally occurring images.

project image

Listen to the Pixels


Sanjoy Chowdhury, Subhrajyoti Dasgupta, Sudip Das, Ujjwal Bhattacharya
International Conference on Image Processing (ICIP), 2021
Paper / Code / Presentation

In this study, we exploited the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner.




Blog(s)

Have tried my hand at writing technical blogs.

project image

The devil is in the details: Video Quality Enhancement Approaches


Link

The blog contextualizes the problem of video enhancement in present-day scenarios and talks about a couple of interesting approaches to handle this challenging task.

Academic services

I have served as a reviewer for the following conferences:

CVPR: 2023, '24, '25

ICCV: 2023

ECCV: 2024

NeurIPS: 2024

AAAI: 2025

WACV: 2022, '23, '24

ACMMM: 2023, '24

ACL: 2024




Affiliations




IIT Kharagpur
Apr-Sep 2016

ISI Kolkata
Feb-July 2017

IIIT Hyderabad
Aug 2017 - May 2019

Mentor Graphics Hyderabad
May - July 2018

Samsung Research Bangalore
June 2019 - June 2021

ShareChat Bangalore
June 2021 - May 2022

UMD College Park
Aug 2022 - Present

Adobe Research
May 2023 - Aug 2023

KAUST
Jan 2024 - Present

Google Research
Feb 2024 - May 2024

Meta AI
May 2024 - Nov 2024

Template credits: Jon Barron and thanks to Richa for making this.