Action anticipation in egocentric videos is a difficult task due to the inherently multi-modal nature of human actions. Additionally, some actions happen faster or slower than others depending on the actor or surrounding context which could vary each time and lead to different predictions. Based on this idea, we build upon RULSTM architecture, which is specifically designed for anticipating human actions, and propose a novel attention-based technique to evaluate, simultaneously, slow and fast features extracted from three different modalities, namely RGB, optical flow, and extracted objects. Two branches process information at different time scales, i.e., frame-rates, and several fusion schemes are considered to improve prediction accuracy. We perform extensive experiments on EpicKitchens-55 and EGTEA Gaze+ datasets, and demonstrate that our technique systematically improves the results of RULSTM architecture for Top-5 accuracy metric at different anticipation times.
翻译:以自我为中心的视频中的预期行动是一项艰巨的任务,因为人类行动本身具有多模式的性质。此外,有些行动发生得更快或慢于其他行动,取决于行为者或周围环境,每个时间可能不同并导致不同的预测。基于这一想法,我们以专门设计用于预测人类行动的RULSTM结构为基础,提出了一种新的关注技术,以同时评价从三种不同模式,即RGB、光学流和提取的物体中提取的缓慢和快速特征。两个分支在不同的时间范围内处理信息,即框架率和若干聚变计划被认为是提高预测的准确性。我们对EpicKitchens-55和EGTEA Gaze+数据集进行了广泛的实验,并表明我们的技术在不同的预期时间系统地改进了RULSTM结构对Top-5精确度测量的结果。