Recent advances in egocentric video understanding models are promising, but their heavy computational expense is a barrier for many real-world applications. To address this challenge, we propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features by combining the semantics from a sparse set of video frames with the head motion from lightweight IMU readings. We further devise a novel self-supervised training strategy for IMU feature learning. Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models. We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
翻译:以自我为中心的视频理解模型最近的进展很有希望,但其沉重的计算成本是许多现实世界应用的障碍。为了应对这一挑战,我们提议EgoDistilling,这是一个基于蒸馏的方法,通过将一套稀少的视频框架的语义与轻量量量的IMU读数的头部运动相结合,学会重建重自负式视频剪辑功能。我们进一步为IMU特征学习设计了一个新的自我监督培训战略。我们的方法导致效率的显著提高,比同等的视频模型要少200xGFLOP。我们在Ego4D和EPECickitchens数据集上展示了它的效力,我们的方法超过了最先进的高效视频理解方法。