文本驱动的人类视频生成：Text2Performer (Text2Performer: Text-Driven Human Video Generation)

Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512x256 resolution) with diverse appearances and flexible motions.

翻译：文本驱动的内容创作已经成为一种变革性技术，革新了创造力。在这里，我们研究了文本驱动的人类视频生成任务，其中视频序列是从描述目标表演者外貌和动作的文本中合成的。与一般的文本驱动视频生成相比，以人类为中心的视频生成需要在执行复杂动作的同时保持合成人类的外观。在本工作中，我们提出了Text2Performer，从文本中生成具有关节动作的生动人类视频。Text2Performer有两个新颖的设计：1）分解人体表示和2）扩散驱动的动作采样器。首先，我们使用人类视频的特性以无监督的方式分解VQVAE潜在空间为人类外貌和姿态表示。通过这种方式，外观在生成的帧中得到良好的维护。然后，我们提出连续的VQ扩散器来采样一系列姿态嵌入。与现有基于VQ的方法不同，连续的VQ扩散器直接输出连续的姿态嵌入，以更好的建模动作。最后，设计了动作感知的遮罩策略，以在时空上掩盖姿态嵌入，以增强时间上的连贯性。此外，为了方便文本驱动的人类视频生成任务，我们贡献了一个手动注释了动作标签和文本描述的时尚-文本2视频数据集。广泛的实验表明，Text2Performer生成高质量的人类视频（高达512x256分辨率），具有多样化的外貌和灵活的动作。