We address the challenging task of human reaction generation, which aims to generate a corresponding reaction based on an input action. Most of the existing works do not focus on generating and predicting the reaction and cannot generate the motion when only the action is given as input. To address this limitation, we propose a novel interaction Transformer (InterFormer) consisting of a Transformer network with both temporal and spatial attention. Specifically, temporal attention captures the temporal dependencies of the motion of both characters and of their interaction, while spatial attention learns the dependencies between the different body parts of each character and those which are part of the interaction. Moreover, we propose using graphs to increase the performance of spatial attention via an interaction distance module that helps focus on nearby joints from both characters. Extensive experiments on the SBU interaction, K3HI, and DuetDance datasets demonstrate the effectiveness of InterFormer. Our method is general and can be used to generate more complex and long-term interactions. We also provide videos of generated reactions and the code with pre-trained models at https://github.com/CRISTAL-3DSAM/InterFormer
翻译:我们处理人类反应生成的艰巨任务,其目的是在输入行动的基础上产生相应的对应反应。大多数现有作品并不侧重于生成和预测反应,在仅作为投入采取行动时无法生成运动。为了应对这一局限性,我们提议建立一个新型互动变异器(InterFormer),由具有时间和空间两方面关注的变异器网络组成。具体地说,时间关注捕捉两个字符动作及其相互作用的时间依赖性,而空间关注则了解每个字符不同身体部分与互动部分之间的依赖性。此外,我们提议使用图表,通过互动距离模块提高空间关注的性能,该模块有助于关注来自两个字符的近邻联合。关于SBU互动、K3HI和DelitDance数据集的广泛实验显示了InterFormer的功效。我们的方法很笼统,可用于产生更复杂和长期的互动。我们还在https://github.com/CRISTAL-3DSAAM/InterFormer 数据集上提供生成的反应和代码的视频,并配有预先培训模型。