This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. This year's dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which previously was a major challenge in the field. The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. Additional material is available via the project website at https://youngwoo-yoon.github.io/GENEAchallenge2022/
翻译:本文报告了以数据驱动的自动共同声音手势生成的第二个GENEA挑战。 参与团队使用相同的演讲和运动数据集来建立手势生成系统。 所有这些系统生成的动作都用于使用标准化的直观管道进行视频,并在多源用户的多项大型研究中进行了评估。 与不同研究论文相比,结果差异只是由于方法不同,使得各系统之间能够进行直接比较。 今年的数据集基于18小时的全体动作捕捉,包括手指,不同参与dyadic对话的人。 10个团队参与了两个层次的挑战: 全体和上体震动。 对于每个层次,我们评估了手势运动的人类相似性及其对于特定演讲信号的适宜性。 我们的评价与不同的是,之所以产生结果差异,只是因为不同的方法不同,使得各系统之间能够进行直接比较。 评估的结果是一场革命,并且是一次披露。 一些合成条件被评为比人类运动捕捉的要多得多得多。 我们最了解的情况是: 在高层次和上体震动上,我们以前从未在高层次的图像网站上展示过动的手动。 通过合成图像记录。 所有的图像记录是适当的原始记录。