Sanjoy Chowdhury
I am a second year CS PhD student at University of Maryland, College Park advised by Prof. Dinesh Manocha. I am broadly interested in multi-modal learning and its different applications. My current line of research involves studying the interplay between vision and audio modality and learning their holistic understanding in a real-world setting.
I am currently working as a student researcher at Google Research with Avisek Lahiri and Vivek Kwatra in the Talking heads team on speech driven facial synthesis. Before Google, I spent a wonderful summer with Adobe Research working with Joseph K J in the Multi-modal AI team as a research PhD intern on multi-modal audio generation. I am also fortunate to have had the chance to work with Prof. Kristen Grauman , Prof. Mohamed Elhoseiny and Ruohan Gao among other wonderful collaborators.
Before this, I was working as a Machine Learning Scientist with the Camera and Video AI team at ShareChat, India. I was also a visiting researcher at the Computer Vision and Pattern Recognition Unit at Indian Statistical Institute Kolkata under Prof. Ujjwal Bhattacharya. Even before, I was a Senior Research Engineer with the Vision Intelligence Group at Samsung R&D Institute Bangalore. I primarily worked on developing novel AI-powered solutions for different smart devices of Samsung.
I received my MTech in Computer Science & Engineering from IIIT Hyderabad where I was fortunate to be advised by Prof. C V Jawahar. During my undergrad, I worked as a research intern under Prof. Pabitra Mitra at IIT Kharagpur and the CVPR Unit at ISI Kolkata.
Email /
GitHub /
Google Scholar /
LinkedIn /
Twitter
|
|
Updates
[Mar 2024]
|
Paper on LLM guided navigational instruction generation got accepted to NAACL 2024
|
[Feb 2024]
|
MeLFusion got accepted to CVPR 2024
|
[Feb 2024]
|
Joined Google Research as a student researcher.
|
[Oct 2023]
|
APoLLo gets accepted to EMNLP 2023
|
[Oct 2023]
|
Invited talk on AdVerb at AV4D Workshop, ICCV 2023
|
[July 2023]
|
AdVerb got accepted to ICCV 2023
|
[May 2023]
|
Joined Adobe Research as a research intern.
|
[Aug 2022]
|
Joined as a CS PhD student at University of Maryland College Park . Awarded Dean's fellowship.
|
[Oct 2021]
|
Paper on audio-visual summarization accepted in BMVC 2021.
|
[Sep 2021]
|
Blog on Video Quality Enhancement released at Tech @ ShareChat.
|
[July 2021]
|
Paper on reflection removal got accepted in ICCV 2021.
|
[June 2021]
|
Joined ShareChat Data Science team.
|
[May 2021]
|
Paper on audio-visual joint segmentation accepted in ICIP 2021.
|
[Dec 2018]
|
Accepted Samsung Research offer. Will be joining in June'19.
|
[Sep 2018]
|
Received Dean's merit list award for academic excellence at IIIT Hyderabad.
|
[Oct 2017]
|
Our work on a multi-scale, low-latency face detection framework received Best Paper Award at NGCT-2017.
|
|
Selected publications
My research is at the intersection of Computer vision, deep learning with a focus on multi-modal learning (Vision + X), generative modeling, visual understanding, and their various applications. I'm broadly interested in studying the interplay between different modalities with minimal supervision.
|
|
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models
Sanjoy Chowdhury*, Sayan Nag*, Joseph KJ, Balaji Vasan Srinivasan, Dinesh Manocha
Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Paper (Coming soon)/
Project Page /
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse", which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area, we introduce a new dataset MeLBench, and propose a new evaluation metric IMSM.
|
|
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models
Sanjoy Chowdhury*, Sayan Nag*, Dinesh Manocha
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Paper /
Project Page /
Poster /
Video /
Code
Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities.
|
|
AdVerb: Visually Guided Audio Dereverberation
Sanjoy Chowdhury*, Sreyan Ghosh*, Subhrajyoti Dasgupta, Anton Ratnarajah, Utkarsh Tyagi, Dinesh Manocha
International Conference on Computer Vision (ICCV), 2023
Paper /
Project Page /
Video /
Poster /
Code
We present a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio.
|
|
Measured Albedo in the Wild: Filling the Gap in Intrinsics Evaluation
Jiaye Wu, Sanjoy Chowdhury, Hariharmano Shanmugaraja, David Jacobs, Soumyadip Sengupta
International Conference on Computational Photography (ICCP), 2023
Paper /
Project Page /
Dataset (coming soon)
In order to comprehensively evaluate albedo, we collect a new dataset, Measured Albedo in the Wild (MAW), and propose three new metrics that complement WHDR
|
|
AudViSum: Self-Supervised Deep Reinforcement Learning for Diverse Audio-Visual Summary Generation
Sanjoy Chowdhury*, Aditya P. Patra*, Subhrajyoti Dasgupta, Ujjwal Bhattacharya
British Machine Vision Conference (BMVC), 2021
Paper /
Code /
Presentation
Introduced a novel deep reinforcement learning-based self-supervised audio-visual summarization model that leverages both audio and visual information to generate diverse yet semantically meaningful summaries.
|
|
V-DESIRR: Very Fast Deep Embedded Single Image Reflection Removal
B H Pawan Prasad, Green Rosh K S, Lokesh R B, Kaushik Mitra, Sanjoy Chowdhury
International Conference on Computer Vision (ICCV), 2021
Paper /
Code
We have proposed a multi-scale end-to-end architecture for detecting and removing weak, medium, and strong reflections from naturally occurring images.
|
|
Listen to the Pixels
Sanjoy Chowdhury, Subhrajyoti Dasgupta, Sudip Das, Ujjwal Bhattacharya
International Conference on Image Processing (ICIP), 2021
Paper /
Code /
Presentation
In this study, we exploited the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner.
|
Blog(s)
Have tried my hand at writing technical blogs.
|
|
The devil is in the details: Video Quality Enhancement Approaches
Link
The blog contextualizes the problem of video enhancement in present-day scenarios and talks about a couple of interesting approaches to handle this challenging task.
|
Academic services
I have served as a reviewer for the following conferences:
CVPR: 2023, '24
ICCV: 2023
ECCV: 2024
WACV: 2022, '23, '24
ACMMM: 2023, '24
|
Affiliations
IIT Kharagpur Apr-Sep 2016
|
ISI Kolkata Feb-July 2017
|
IIIT Hyderabad Aug 2017 - May 2019
|
Mentor Graphics Hyderabad May - July 2018
|
Samsung Research Bangalore June 2019 - June 2021
|
ShareChat Bangalore June 2021 - May 2022
|
UMD College Park Aug 2022 - Present
|
Adobe Research May 2023 - Aug 2023
|
KAUST Jan 2024 - Present
|
Google Research Feb 2024 - May 2024
|
|