In this report, we present the ReLER@ZJU1 submission to the Ego4D Moment Queries Challenge in ECCV 2022. In this task, the goal is to retrieve and localize all instances of possible activities in egocentric videos. Ego4D dataset is challenging for the temporal action localization task as the temporal duration of the videos is quite long and each video contains multiple action instances with fine-grained action classes. To address these problems, we utilize a multi-scale transformer to classify different action categories and predict the boundary of each instance. Moreover, in order to better capture the long-term temporal dependencies in the long videos, we propose a segment-level recurrence mechanism. Compared with directly feeding all video features to the transformer encoder, the proposed segment-level recurrence mechanism alleviates the optimization difficulties and achieves better performance. The final submission achieved Recall@1,tIoU=0.5 score of 37.24, average mAP score of 17.67 and took 3-rd place on the leaderboard.
翻译:在本报告中,我们向ECCV 2022 Ego4D Moment Queries Challenge 提交 ReLER ⁇ JU1 提交 Ego4D Moment Query 挑战 。 在这项任务中,我们的目标是检索所有以自我为中心的视频中可能开展的活动,并将其本地化。 Ego4D 数据集对时间行动定位任务具有挑战性,因为视频的时间长度相当长,每段视频都包含多种行动实例,带有细微的动作类别。为了解决这些问题,我们使用一个多级变压器来分类不同行动类别,并预测每个实例的界限。此外,为了更好地捕捉到长视频中的长期时间依赖性,我们提议了一个部分级重现机制。与将所有视频特性直接输入变压器编码器相比,拟议的部分级重现机制可以减轻优化难度,并实现更好的性能。 最后提交的结果是,重呼@1,tIoU=0.5分37.24分,平均 mAP分为17.67分,并在领导板上占据第3位。