The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. We leverage the power of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and present a novel CLIP-guided mechanism that extracts efficient style representations from multiple input modalities, such as a piece of text, an example motion clip, or a video. Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator via an adaptive instance normalization (AdaIN) layer. We further devise a gesture-transcript alignment mechanism that ensures a semantically correct gesture generation based on contrastive learning. Our system can also be extended to allow fine-grained style control of individual body parts. We demonstrate an extensive set of examples showing the flexibility and generalizability of our model to a variety of style descriptions. In a user study, we show that our system outperforms the state-of-the-art approaches regarding human likeness, appropriateness, and style correctness.
翻译:最近,自动生成风格化的共话手势引起了越来越多的关注。以往的系统通常允许通过预定义的文本标签或示例动作片段进行样式控制,但这种方法通常不足以准确传达用户意图。在本文中,我们提出了GestureDiffuCLIP,这是一个神经网络框架,用于合成具有灵活样式控制的逼真,风格化的共话手势。我们利用大规模的对比语言图像预训练(CLIP)模型的能力,并提出一种新颖的CLIP引导机制,从多个输入模态(如一段文本,示例动作片段或视频)中提取有效的样式表示。我们的系统学习了一个潜在扩散模型来生成高质量的手势,并通过自适应实例归一化(AdaIN)层将生成器的CLIP表示注入到其中。我们进一步设计了一个手势-转录对齐机制,基于对比学习确保语义上正确的手势生成。我们的系统还可以扩展为允许对个体身体部位进行细粒度样式控制。我们展示了大量例子,展示了我们模型对各种样式描述的灵活性和普适性。在用户研究中,我们表明,我们的系统在人类相似度,适当性和样式正确性方面优于现有技术。