We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the same input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles.
翻译:我们通过实例展示了“零EGGS ” ( ZeroEGGS ), 用于语音驱动手势生成的神经网络框架, 以零射样式控制为例。 这意味着样式只能通过一个短短的动作短片来控制, 即使是在训练期间看不见的运动风格。 我们的模型使用一个变式框架来学习风格嵌入, 使得通过潜伏的空间操纵或混合和缩放风格嵌入来修改风格变得容易。 我们的框架的概率性进一步使得根据相同的输入生成各种产出, 解决手势运动的随机性。 在一系列实验中, 我们首先展示了我们模型的灵活性和通用性, 以新的演讲者和风格。 在一项用户研究中, 我们随后展示了我们的模型在运动的自然性、 语言的适宜性以及风格描述方面超越了先前的状态技术。 最后, 我们发布了一套高质量的全体手势动作数据集, 包括手指、 语言, 跨越19种不同的风格。