The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. We leverage the power of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and present a novel CLIP-guided mechanism that extracts efficient style representations from multiple input modalities, such as a piece of text, an example motion clip, or a video. Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator via an adaptive instance normalization (AdaIN) layer. We further devise a gesture-transcript alignment mechanism that ensures a semantically correct gesture generation based on contrastive learning. Our system can also be extended to allow fine-grained style control of individual body parts. We demonstrate an extensive set of examples showing the flexibility and generalizability of our model to a variety of style descriptions. In a user study, we show that our system outperforms the state-of-the-art approaches regarding human likeness, appropriateness, and style correctness.
翻译:近期自动生成风格化共语手势引起了越来越多的关注。之前的系统通常通过预定义的文本标签或示例动作片段来进行风格控制,往往不能够准确地传达用户意图。本研究提出了GestureDiffuCLIP,这是一个具有灵活风格控制的神经网络框架,用于合成逼真、风格化的共语手势。我们利用大规模对比语言-图像-预训练(CLIP)模型的强大能力,并提出了一种新颖的基于CLIP引导的机制,从多种输入模态,如一段文本、一个示例动作片段或一个视频中提取有效的风格表示。我们的系统通过自适应实例规范化层(AdaIN)学习潜变量扩散模型,以生成高质量的手势,并将风格的CLIP表示注入到生成器中。我们进一步设计了一个手勢-转录对齐机制,基于对比学习确保了合理的手势生成。我们的系统还可以扩展为允许对单个身体部位进行细粒度风格控制。我们展示了一组广泛的示例,展示了我们的模型对各种风格描述的灵活性和普适性。在一个用户研究中,我们展示了我们的系统在人类相似度、适宜性和风格正确性方面优于最先进的方法。