在一个大规模公开挑战中评估自动生成:GENEA挑战 2022年</s> (Evaluating gesture-generation in a large-scale open challenge: The GENEA Challenge 2022)

from arxiv, The first three authors made equal contributions and share joint first authorship. arXiv admin note: substantial text overlap with arXiv:2208.10441

This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fr\'echet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around -0.5. Based on the challenge results we formulate numerous recommendations for system building and evaluation.

翻译：本文报告了第二个 GENEA 挑战。参与团队使用相同的语音和运动数据集来建立动作生成系统。所有这些系统生成的动作都变成视频, 使用标准化的视觉编程管道进行视频, 并在多个大型、众源用户研究中进行评估。与不同研究论文相比, 结果差异只是因为方法不同, 使得各系统之间能够直接比较。数据集基于全体捕捉不同人士, 包括手指在内, 进行三角对话的18小时全体动作。十个团队参与了两个层次的挑战: 全体和上体震荡。对于每个层次,我们评价了动作的人类相似性, 并评价了它对于特定演讲信号的适宜性。我们的评价与手势相不相像, 这在现场是一个困难的问题。评价结果是革命, 和启示。一些合成条件被评分得比人类运动捕捉到的更像人类运动。最好的是, 这10个团队参与了两个层次的挑战: 之前, 在高F 和上身体的动作动作, 动作动作的人类的比直径的直径直径直径, 我们发现, 直径直径直径直径直径直径, 在一个方向上, 我们发现, 在一个方向上, 我们的直径直径直判, 找到了, 在一个方向, 在一个方向上, 我们找到了一个方向上, 在一个方向上, 直径直判, 找到了一个方向, 找到, 直路, 直径直路。</s>