We present a new architecture for human action forecasting from videos. A temporal recurrent encoder captures temporal information of input videos while a self-attention model is used to attend on relevant feature dimensions of the input space. To handle temporal variations in observed video data, a feature masking techniques is employed. We classify observed actions accurately using an auxiliary classifier which helps to understand what has happened so far. Then the decoder generates actions for the future based on the output of the recurrent encoder and the self-attention model. Experimentally, we validate each component of our architecture where we see that the impact of self-attention to identify relevant feature dimensions, temporal masking, and observed auxiliary classifier. We evaluate our method on two standard action forecasting benchmarks and obtain state-of-the-art results.
翻译:我们从视频中展示了人类行动预测的新架构。 时间性经常性编码器捕捉输入视频的时间信息, 而同时使用一个自我注意模型来关注输入空间的相关特征层面。 为了处理观测到的视频数据的时间变化, 我们采用了一种特征掩码技术。 我们使用辅助分类器对观测到的行动进行了准确分类, 这有助于了解迄今为止发生的情况。 然后, 解码器根据经常性编码器和自留模型的输出为未来生成行动。 实验性地, 我们验证了我们架构的每个组成部分, 在那里我们看到了自我注意对确定相关特征层面的影响, 时间掩码和观察辅助分类器。 我们用两个标准行动预测基准评估了我们的方法, 并获得了最新的结果 。