Achieving realistic, vivid, and human-like synthesized conversational gestures conditioned on multi-modal data is still an unsolved problem, due to the lack of available datasets, models and standard evaluation metrics. To address this, we build Body-Expression-Audio-Text dataset, BEAT, which has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations.Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with facial expressions, emotions, and semantics, in addition to the known correlation with audio, text, and speaker identity. Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge, BEAT is the largest motion capture dataset for investigating the human gestures, which may contribute to a number of different research fields including controllable gesture synthesis, cross-modality analysis, emotional gesture recognition. The data, code and model will be released for research.
翻译:由于缺乏可用的数据集、模型和标准评价衡量标准,实现以多模式数据为条件的现实、生动和人式的合成对话姿态仍然是一个尚未解决的问题。为了解决这个问题,我们建立了身体表达-Audio-Text数据集,BEAT,该数据集(i) 76小时,从讲八种不同情感和四种不同语言的30位发言者那里收集的高质量、多式数据,二) 32百万个框架级情感和语义相关性说明。我们关于BEAT的统计分析表明,对话姿态与面部表达、情感和语义的相互关系,除了已知的音频、文字和发言者身份的相关性外,还存在。定性和定量实验显示了衡量标准的有效性、地面真相数据质量和基线的最新性表现。据我们所知,BEAT是调查人类姿态的最大动作捕获数据集,这可能有助于若干不同的研究领域,包括可控的姿态合成、跨式分析、情感动作识别。数据、代码和模型将被用于研究。