Some group activities, such as team sports and choreographed dances, involve closely coupled interaction between participants. Here we investigate the tasks of inferring and predicting participant behavior, in terms of motion paths and actions, under such conditions. We narrow the problem to that of estimating how a set target participants react to the behavior of other observed participants. Our key idea is to model the spatio-temporal relations among participants in a manner that is robust to error accumulation during frame-wise inference and prediction. We propose a novel Entry-Flipped Transformer (EF-Transformer), which models the relations of participants by attention mechanisms on both spatial and temporal domains. Unlike typical transformers, we tackle the problem of error accumulation by flipping the order of query, key, and value entries, to increase the importance and fidelity of observed features in the current frame. Comparative experiments show that our EF-Transformer achieves the best performance on a newly-collected tennis doubles dataset, a Ceilidh dance dataset, and two pedestrian datasets. Furthermore, it is also demonstrated that our EF-Transformer is better at limiting accumulated errors and recovering from wrong estimations.
翻译:团队运动和舞蹈舞蹈等团体活动涉及参与者之间密切配合的互动。 我们在此调查在这种条件下,从运动路径和行动的角度,推断和预测参与者行为的任务。 我们缩小问题的范围,以估计一个设定的目标参与者如何对其他观察到的参与者的行为作出反应。 我们的关键想法是模拟参与者之间的时空关系,其方式要稳健,以便在框架明智的推论和预测期间误差积累。 我们提议了一个新颖的“入口式变换器 ” ( EF-Transref),它通过空间和时间领域的关注机制来模拟参与者的关系。 与典型的变异器不同,我们通过翻转查询、关键和价值条目的顺序来解决错误积累问题,以提高当前框架中观察到的特征的重要性和真实性。 比较实验表明,我们的EF-Transreferex在新收集的网球双倍数据集、Ceilidh舞蹈数据集和两个行人行数据集上取得最佳性能。 此外,它还表明,我们的EF-Trafrender在限制累积错误和从错误中恢复方面做得更好。