In this paper, we propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. At the frame level, we use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects and calculates the action score through a CNN formulation. This information is then fed to a Hierarchical LongShort-Term Memory Network (HLSTM) that captures temporal dependencies between actions within and across shots. Ablation studies thoroughly validate the proposed approach, showing in particular that both levels of the HLSTM architecture contribute to performance improvement. Furthermore, quantitative comparisons show that the proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks,without relying on motion information
翻译:在本文中,我们建议对以自我为中心的视频中的暗中行为采取新的办法,利用框架和时间层面物体相互作用的语义。在框架层面,我们采用基于区域的办法,将一个与用户手相对应的主要区域和一组可能与互动对象相对应的第二区域作为投入,并通过有线电视新闻网的配方计算行动得分。然后,将这一资料输入一个等级式的长期短期内存网络(HLSTM),这个网络可以捕捉不同镜头之间行动之间的时间依赖性。 模拟研究彻底验证了拟议办法,特别表明HLSTM结构的两个级别都有助于业绩的改善。此外,定量比较表明,拟议的办法在不依赖运动信息的情况下,在标准基准的行动承认方面超过了最新水平。