Although human action anticipation is a task which is inherently multi-modal, state-of-the-art methods on well known action anticipation datasets leverage this data by applying ensemble methods and averaging scores of unimodal anticipation networks. In this work we introduce transformer based modality fusion techniques, which unify multi-modal data at an early stage. Our Anticipative Feature Fusion Transformer (AFFT) proves to be superior to popular score fusion approaches and presents state-of-the-art results outperforming previous methods on EpicKitchens-100 and EGTEA Gaze+. Our model is easily extensible and allows for adding new modalities without architectural changes. Consequently, we extracted audio features on EpicKitchens-100 which we add to the set of commonly used features in the community.
翻译:尽管人类行动预测是一项任务,它本质上是众所周知的行动预测数据集的多式、最先进的方法,它通过应用混合方法和平均单式预测网络的分数来利用这些数据。在这项工作中,我们引进了基于变压器的模式聚合技术,在早期阶段将多式数据统一起来。我们的预测性地变异变异器(AFFT)证明优于流行的分数组合方法,并展示了比起EpicKitchens-100和EGTEA Gaze+以往方法的先进结果。我们的模型很容易推广,并允许在不改变建筑的情况下添加新的模式。因此,我们在EpicKitchens-100上提取了音频特征,我们在社区中添加了一套常用的特征。