We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective cues. We design an affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features. We use our affective encoder in both our generator, where it learns affective features from the seed poses to guide the gesture synthesis, and our discriminator, where it enforces the synthesized gestures to contain the appropriate affective expressions. We perform extensive evaluations on two benchmark datasets for gesture synthesis from the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared to the best baselines, we improve the mean absolute joint error by 10--33%, the mean acceleration difference by 8--58%, and the Fr\'echet Gesture Distance by 21--34%. We also conduct a user study and observe that compared to the best current baselines, around 15.28% of participants indicated our synthesized gestures appear more plausible, and around 16.32% of participants felt the gestures had more appropriate affective expressions aligned with the speech.
翻译:我们展示了一个基因对抗网络, 以合成 3D 构成共声波上部动作的序列, 并配有适当的感官表达式。 我们的网络由两个组成部分组成: 一个生成器, 用来合成从输入演讲和种子配置的调制特征联合嵌入空间的动作; 一个区分合成的调制序列和真实的 3D 构成序列的区分器。 我们利用 Mel- 频率阴性系数和从输入演讲中以不同的编码器计算出来的文字笔录, 以学习想要的情绪和相关的感知提示。 我们设计了一个感知解调调。 我们用多尺度的空间时态图变动将3D 的序列转换成潜值、 以姿势为基础的感知性特征。 我们使用我们的感知编码器来区分合成的复合序列序列和真实 3D 3D 的顺序。 我们利用综合的调制的调制手势以包含适当的感官表达式表达式来进行广泛的评价。 我们用多尺度的空基调调调的调调调调调直线图解导了 。