Sanjoy's Research Garden
X



Cross-modal Generation (2024-now)

The exploration of cross-modal generation is a burgeoning field that seeks to integrate and synthesize diverse data forms into cohesive outputs. MAGNET introduces a novel method for audio-visual RAG enhancing the fluidity with which machines process multimodal information. MeLFusion (CVPR 2024) demonstrates techniques for effective multimedia synthesis, offering new avenues for creative and applied uses in technology. Meanwhile, Adverb (ICCV 2023) focuses on visually guided audio dereverberation, setting a new standard in blending visual cues with audio refinement, ensuring clarity and precision in soundscapes. Together, these works pave the way for more seamless and intuitive interactions among various sensory modalities, underscoring the importance of integration in advancing digital experiences.

Audio-Visual Representation Learning (2021-now)

Audio-visual representation learning addresses the need for machines to understand and summarize sensory data efficiently. In EgoAdapt (ICCV 2025), new adaptation methods are developed for egocentric perspectives, enabling systems to extract and utilize contextual insights from first-person views. AudViSum (BMVC 2021) contributes by creating innovative summarization techniques that distill critical information from vast streams of audio-visual data, optimizing performance in real-time applications. Listen to Pixels (ICIP 2021) transforms how audio cues are visually represented, translating sound into visual stimuli that are easier for AI to interpret. Collectively, these studies provide a framework for advancing machine perception to interact with the world in human-like ways.

Audio-Visual LLMs (2024-now)

In the realm of audio-visual LLMs, the integration of auditory and visual data enhances contextual comprehension and information processing. AURELIA (ICCV 2025) lays the foundation for audio visual reasoning facilitating nuanced interactions within complex environments. AVTrustBench (ICCV 2025) assesses the reliability of AI interpretations across varied audio-visual contexts, ensuring consistent and accurate model performance. Meerkat (ECCV 2024) introduces cutting-edge approaches to enable fine grained understanding in audio-visual LLMs by bridging the gap between sensory inputs and language understanding. These efforts collectively aim to elevate AI's capability to navigate and interpret multifaceted scenarios, akin to human cognitive processes.

Integrating Vision-Language (2022-now)

The integration of vision and language is pivotal for enhancing AI's task performance in dynamic environments. ASPIRE (ACL Findings 2024) illustrates this by refining navigation and interaction tasks through improved sensory synchronization. VLMNav (NAACL 2024) focuses on augmenting navigation capabilities by seamlessly combining visual and verbal cues, thereby enabling more intuitive AI guidance systems. Intent-o-Meter (Nature Scientific Reports 2023) introduces revolutionary approaches for understanding intent, offering deeper insights into human-like AI interactions. Complementing these breakthroughs, Apollo (EMNLP 2023) enhances the synergy between words and images, contributing significantly to AI's communication proficiency. Together, these works chart a path towards more effective and context-aware AI systems.

Computational Photography (2021-2022)

Computational photography is at the intersection of technology and art, striving to enhance image and video quality through advanced processing techniques. MAW (ICCP 2023) introduces state-of-the-art methods that refine visual clarity and detail, vital for applications demanding high-standard aesthetics. VDESIRR (ICCV 2021) complements this by focusing on video optimization, employing novel processing algorithms to elevate both the visual and functional quality of digital media. By advancing these techniques, these works make substantial contributions to fields where superior image quality is paramount, enriching the digital visual landscape with unparalleled precision.