We present Ego-Only, the first training pipeline that enables state-of-the-art action detection on egocentric (first-person) videos without any form of exocentric (third-person) pretraining. Previous approaches found that egocentric models cannot be trained effectively from scratch and that exocentric representations transfer well to first-person videos. In this paper we revisit these two observations. Motivated by the large content and appearance gap separating the two domains, we propose a strategy that enables effective training of egocentric models without exocentric pretraining. Our Ego-Only pipeline is simple. It trains the video representation with a masked autoencoder finetuned for temporal segmentation. The learned features are then fed to an off-the-shelf temporal action localization method to detect actions. We evaluate our approach on two established egocentric video datasets: Ego4D and EPIC-Kitchens-100. On Ego4D, our Ego-Only is on-par with exocentric pretraining methods that use an order of magnitude more labels. On EPIC-Kitchens-100, our Ego-Only even outperforms exocentric pretraining (by 2.1% on verbs and by 1.8% on nouns), setting a new state-of-the-art.
翻译:我们提出了第一个培训管道Ego-Only。 我们的Ego-Only管道很简单。 它用一种掩码自动电解器对视频代表进行训练,对时间分割进行微调调整。 学到的功能随后被反馈到一种现成的时间行动定位方法中以探测行动。 我们评估了我们对于两个既定的以自我为中心的视频数据集: Ego4D 和 EIPIC-Kitchens-100。 在 Ego4D 上,我们的Ego-Only使用一种外向型前导法进行升级,这种前向型前导法使用更高级的标签。 在 EPIC-Kitchens-100 上, 我们的Ego-Only使用一种甚至更高级的外向前导法。 在EPIC-Kitchens-100 之前, 我们的Ego-Only-stal-strain 上, 我们的EGIK-B-C-I-C-S-C-IL-S-S-S-S-IL-S-B-S-Agental-S-B-S-B-S-SLAD-SB-B-B-S-INS-S-S-S-B-S-S-B-B-S-S-B-S-S-B-B-SB-B-S-S-S-S-S-S-S-B-S-S-S-S-S-S-S-B-S-S-S-S-S-B-B-B-B-B-B-B-B-B-B-S-S-S-B-S-S-B-B-S-S-S-S-S-S-S-S-B-B-B-B-B-B-B-B-S-S-B-S-S-S-S-B-B-B-B-B-B-B-S-S-B-B-B-B-B-B-B-B-B-B-S-S-S-S-S-S-S-S-S-B-B-B-B-B-S-B-B-B-B-B-S-B