Co-speech gesture is crucial for human-machine interaction and digital entertainment. While previous works mostly map speech audio to human skeletons (e.g., 2D keypoints), directly generating speakers' gestures in the image domain remains unsolved. In this work, we formally define and study this challenging problem of audio-driven co-speech gesture video generation, i.e., using a unified framework to generate speaker image sequence driven by speech audio. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video. Demo video and more resources can be found in: https://alvinliu0.github.io/projects/ANGIE
翻译:共同语音手势对于人体机器互动和数字娱乐至关重要。 虽然以前的工作主要是将语音音频映射到人体骨骼(例如 2D 键点), 直接生成图像域内的演讲者的手势仍未解决。 在这项工作中, 我们正式定义和研究音频驱动共同语音音频手势视频生成这一具有挑战性的问题, 即, 使用一个统一的框架来生成由语音音频驱动的音频图像序列。 我们的关键洞察力是, 共同语音音频手势可以分解成共同的动作模式和微妙的节奏动态动态。 为此, 我们提议了一个新颖的框架, 音频驱动器 Gestur 和Ideo Eneration (ANGIE), 以有效捕捉可重新使用的同声调调音频手势模式, 以及细微调音频图像生成, 我们利用一个不超超超超超的动作代表, 而不是结构的人体结构结构( 例如, 2DPT 骨骼) 。 具体地说, 我们提议一个矢量摄取运动摄像带( VQD-drival-drivate movation) (G- more more more made) rodemodealalalbalationsalbalbalmalmalationalbalbalmalpsmationalpress) 。