We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a re-designed output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA, show that by explicitly decoding action-context-object descriptors, EgoACO achieves state-of-the-art recognition performance.
翻译:我们展示了EgoACO,这是一个深层的视觉行动识别神经结构,它通过在以自我为中心的视频数据集中利用动作标语的动词-点名结构,将动作标语的标语从框架级特征中集合起来。EgoACO的核心组成部分是阶级激活集合(CAP),这是一个可区别的集合行动,它结合了双线集合的理念,用于细微的认知,以及用于歧视性本地化的特征学习。CAP使用一种可学习重量字典自我意识词典,从最相关的特征区域集合起来。EgoACO通过CAP,学会将动作标语和场景标语的标语从视频框特征中解码对象和场景标语。在EgoACO的时间模型中,我们设计了一个名为“长期关注(LSTA)”的反复版本。LSTA扩展了具有内在空间关注和重新设计的输出门的“LSTMDGM ” 。动作、对象和背景标语解词结合了多头预测,该预测对不相偏执和自我意识动作的自我识别,有助于目前以自我意识自我意识解释为核心的自我定位的标签,以显示现有EGE-C-C-C-C-C-C-C-IES-C-C-C-C-C-C-C-C-C-C-C-L-A-A-A-C-L-A-A-A-A-C-A-A-A-A-A-A-A-A-A-A-A-C-C-A-C-A-A-A-A-A-A-C-C-A-A-A-C-A-A-A-A-A-A-A-A-A-A-A-A-A-A-I-A-A-A-A-A-A-A-A-A-A-A-A-A-A-I-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-C-