EGOADAPT: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception

ICCV 2025


1University of Maryland, 2Meta, 3Worcester Polytechnic Institute, 4University of Toronto,

TL;DR

Modern perception models, particularly those designed for multisensory egocentric tasks, have achieved remarkable performance but often come with substantial computational costs. These high demands pose challenges for real-world deployment, especially in resource-constrained environments. In this paper, we introduce EGOADAPT, a framework that adaptively performs cross-modal distillation and policy learning to enable efficient inference across different egocentric perception tasks, including egocentric action recognition, active speaker localization, and behavior anticipation. Our proposed policy module is adaptable to task-specific action spaces, making it broadly applicable. Experimental results on three challenging egocentric datasets—EPIC-Kitchens, EasyCom, and Aria Everyday Activities—demonstrate that our method significantly enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%, and energy up to 9.6×, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.

-->

🔥Highlights

Key contributions of EGOADAPT:
  1. We propose a unified framework that harnesses the strengths of both cross-modal distillation and policy learning to achieve the optimal balance between performance and efficiency.

  2. We efficiently train a multi-modal policy network jointly with the distillation model using standard backpropagation through Gumbel-Softmax sampling, adapting to various egocentric tasks by expanding its action space.

  3. We validate our approach on three challenging datasets and egocentric perception tasks, achieving up to 89.09%reduction in GMACs, 82.02% reduction in parameters, and up to 9.6× reduction in energy, while still on-par and in many cases outperforming, the performance of corresponding state-of-the-art models.

EGOADAPT Framework

Illustration of EGOADAPT. Our framework consists of two main components, a lightweight policy module Π and a distillation module Φ composed of different sub-networks that are trained jointly (via late fusion with learnable weights) for various egocentric tasks, including action recognition, active speaker localization, and behaviour anticipation. The policy module is adaptable, dynamically selecting the optimal modality, video frame, and audio channel based on the downstream task to balance performance and efficiency. The entire pipeline is trainable jointly over multiple stages (right side). In Stage 1, the policy module is disabled. FPS: Frames per second.

Quantitative Results

Qualitative Results

BibTeX

@article{chowdhury2025egoadapt,
  title={EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception},
  author={Chowdhury, Sanjoy and Biswas, Subrata and Nag, Sayan and Nagarajan, Tushar and Murdock, Calvin and Ananthabhotla, Ishwarya and Qian, Yijun and Ithapu, Vamsi Krishna and Manocha, Dinesh and Gao, Ruohan},
  journal={arXiv preprint arXiv:2506.21080},
  year={2025}
}