Sanjoy Chowdhury

I am a second year CS PhD student at University of Maryland, College Park advised by Prof. Dinesh Manocha. I am broadly interested in multi-modal learning and its different applications. My current line of research involves studying the interplay between vision and audio modality and learning their holistic understanding in a real-world setting.

I am currently working as a research scientist intern at Meta Reality Labs. Before this, I was a student researcher at Google Research with Avisek Lahiri and Vivek Kwatra in the Talking heads team on speech driven facial synthesis. Previously, I spent a wonderful summer with Adobe Research working with Joseph K J in the Multi-modal AI team as a research PhD intern on multi-modal audio generation. I am also fortunate to have had the chance to work with Prof. Kristen Grauman , Prof. Mohamed Elhoseiny and Ruohan Gao among other wonderful collaborators.

Before this, I was working as a Machine Learning Scientist with the Camera and Video AI team at ShareChat, India. I was also a visiting researcher at the Computer Vision and Pattern Recognition Unit at Indian Statistical Institute Kolkata under Prof. Ujjwal Bhattacharya. Even before, I was a Senior Research Engineer with the Vision Intelligence Group at Samsung R&D Institute Bangalore. I primarily worked on developing novel AI-powered solutions for different smart devices of Samsung.

I received my MTech in Computer Science & Engineering from IIIT Hyderabad where I was fortunate to be advised by Prof. C V Jawahar. During my undergrad, I worked as a research intern under Prof. Pabitra Mitra at IIT Kharagpur and the CVPR Unit at ISI Kolkata.

Email  /  GitHub  /  Google Scholar  /  LinkedIn  /  Twitter

profile photo


  • May 2024 - Joined Meta Reality Labs as a Research Scientist intern. project image
  • May 2024 - Our paper on determining perceived audience intent from multi-modal social media posts got accepted to Nature Scientific Reports
  • Mar 2024 - Paper on LLM guided navigational instruction generation got accepted to NAACL 2024
  • Feb 2024 - MeLFusion ( Highlight, Top 2.8% ) got accepted to CVPR 2024 project image
  • Feb 2024 - Joined Google Research as a student researcher.
  • Oct 2023 - APoLLo gets accepted to EMNLP 2023
  • Oct 2023 - Invited talk on AdVerb at AV4D Workshop, ICCV 2023
  • July 2023 - AdVerb got accepted to ICCV 2023
  • May 2023 - Joined Adobe Research as a research intern.
  • Aug 2022 - Joined as a CS PhD student at University of Maryland College Park . Awarded Dean's fellowship.
  • Oct 2021 - Paper on audio-visual summarization accepted in BMVC 2021.
  • Sep 2021 - Blog on Video Quality Enhancement released at Tech @ ShareChat.
  • July 2021 - Paper on reflection removal got accepted in ICCV 2021.
  • June 2021 - Joined ShareChat Data Science team.
  • May 2021 - Paper on audio-visual joint segmentation accepted in ICIP 2021.
  • Dec 2018 - Accepted Samsung Research offer. Will be joining in June'19.
  • Sep 2018 - Received Dean's Merit List Award for academic excellence at IIIT Hyderabad.
  • Oct 2017 - Our work on a multi-scale, low-latency face detection framework received Best Paper Award at NGCT-2017.

Selected publications

My research is at the intersection of Computer vision, deep learning with a focus on multi-modal learning (Vision + X), generative modeling, visual understanding, and their various applications. I'm broadly interested in studying the interplay between different modalities with minimal supervision.

project image

project imageMeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models ( Highlight, Top 2.8% )

Sanjoy Chowdhury*, Sayan Nag*, Joseph KJ, Balaji Vasan Srinivasan, Dinesh Manocha
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Paper/ Project Page / Poster / Video / Dataset / Code

We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM.

project image

APoLLo project image: Unified Adapter and Prompt Learning for Vision Language Models

Sanjoy Chowdhury*, Sayan Nag*, Dinesh Manocha
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Paper / Project Page / Poster / Video / Code

Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities.

project image

AdVerb: Visually Guided Audio Dereverberation

Sanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
International Conference on Computer Vision (ICCV), 2023
Paper / Project Page / Video / Poster / Code

We present a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio.

project image

Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation

Jiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, Soumyadip Sengupta
International Conference on Computational Photography (ICCP), 2023
Paper / Project Page / Dataset (coming soon)

In order to comprehensively evaluate albedo, we collect a new dataset, Measured Albedo in the Wild (MAW), and propose three new metrics that complement WHDR

project image

AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary Generation

Sanjoy Chowdhury*, Aditya P. Patra*, Subhrajyoti Dasgupta, Ujjwal Bhattacharya
British Machine Vision Conference (BMVC), 2021
Paper / Code / Presentation

Introduced a novel deep reinforcement learning-based self-supervised audio-visual summarization model that leverages both audio and visual information to generate diverse yet semantically meaningful summaries.

project image

V-DESIRR: Very Fast Deep Embedded Single Image Reflection Removal

B H Pawan Prasad, Green Rosh K S, Lokesh R B, Kaushik Mitra, Sanjoy Chowdhury
International Conference on Computer Vision (ICCV), 2021
Paper / Code

We have proposed a multi-scale end-to-end architecture for detecting and removing weak, medium, and strong reflections from naturally occurring images.

project image

Listen to the Pixels

Sanjoy Chowdhury, Subhrajyoti Dasgupta, Sudip Das, Ujjwal Bhattacharya
International Conference on Image Processing (ICIP), 2021
Paper / Code / Presentation

In this study, we exploited the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner.


Have tried my hand at writing technical blogs.

project image

The devil is in the details: Video Quality Enhancement Approaches


The blog contextualizes the problem of video enhancement in present-day scenarios and talks about a couple of interesting approaches to handle this challenging task.

Academic services

I have served as a reviewer for the following conferences:

CVPR: 2023, '24

ICCV: 2023

ECCV: 2024

NeurIPS: 2024

WACV: 2022, '23, '24

ACMMM: 2023, '24


IIT Kharagpur
Apr-Sep 2016

ISI Kolkata
Feb-July 2017

IIIT Hyderabad
Aug 2017 - May 2019

Mentor Graphics Hyderabad
May - July 2018

Samsung Research Bangalore
June 2019 - June 2021

ShareChat Bangalore
June 2021 - May 2022

UMD College Park
Aug 2022 - Present

Adobe Research
May 2023 - Aug 2023

Jan 2024 - Present

Google Research
Feb 2024 - May 2024

Meta AI
May 2024 - Aug 2024

Template credits: Jon Barron and thanks to Richa for making this.