Audio2Gestures: 从音频生成多种手势 (Audio2Gestures: Generating Diverse Gestures from Audio)

People may perform diverse gestures affected by various mental and physical factors when speaking the same sentences. This inherent one-to-many relationship makes co-speech gesture generation from audio particularly challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, easily resulting in plain/boring motions during inference. So we propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code. The shared code is expected to be responsible for the motion component that is more correlated to the audio while the motion-specific code is expected to capture diverse motion information that is more independent of the audio. However, splitting the latent code into two parts poses extra training difficulties. Several crucial training losses/strategies, including relaxed motion loss, bicycle constraint, and diversity loss, are designed to better train the VAE. Experiments on both 3D and 2D motion datasets verify that our method generates more realistic and diverse motions than previous state-of-the-art methods, quantitatively and qualitatively. Besides, our formulation is compatible with discrete cosine transformation (DCT) modeling and other popular backbones (\textit{i.e.} RNN, Transformer). As for motion losses and quantitative motion evaluation, we find structured losses/metrics (\textit{e.g.} STFT) that consider temporal and/or spatial context complement the most commonly used point-wise losses (\textit{e.g.} PCK), resulting in better motion dynamics and more nuanced motion details. Finally, we demonstrate that our method can be readily used to generate motion sequences with user-specified motion clips on the timeline.

翻译：在使用相同句子时,人们可能会作出不同的姿态,受到各种精神和物理因素的影响。这种内在的一对多种关系使得声音特别具有挑战性。常规CNN/RNNs将进行一对一的映射,从而倾向于预测所有可能的目标动议的平均值,从而在推断过程中容易导致简单/博动。因此我们提议通过将跨模式潜伏代码分为共享代码和运动特定代码来明确模拟一对多个音频到运动的映射。这种共享代码预计将对与声音更相联的运动部分负责,而普通运动特定代码则预计将捕捉与音频更独立的不同运动信息。然而,将潜伏代码分成两个部分则造成额外的培训困难。一些关键的培训损失/策略,包括放松运动损失、自行车限制和多样性损失,目的是更好地培训VAE。在3D和2D运动中进行实验,用直流/流流流的数据集来核查我们的方法产生比先前的状态背景下更现实和多样化的动作,而普通的代码代码将捕捉取到最独立的运动。此外,我们使用的是结构流变换方法,我们所使用的是更相的。最后的方法,我们所使用的是结构变换。我们所使用的方法,我们所使用的是更相的。