姿态导向的文本到视频生成：利用无姿态视频的姿态无关视频生成 (Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos) - 专知论文

会员服务 ·

0

视频生成 · 视频 · 可控 · 图像模型 · 数据集 ·

2023 年 4 月 3 日

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

翻译：姿态导向的文本到视频生成：利用无姿态视频的姿态无关视频生成

Yue Ma,Yingqing He,Xiaodong Cun,Xintao Wang,Ying Shan,Xiu Li,Qifeng Chen

from arxiv, Project page: https://follow-your-pose.github.io/; Github repository: https://github.com/mayuelala/FollowYourPose

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolu- tional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.

翻译：生成可以编辑文本和受姿态控制的角色视频对于创建各种数字人形象有着迫切需求。然而，这项任务一直受到缺乏包含视频-姿态字幕配对和视频生成先验模型的综合数据集的限制。在本文中，我们设计了一种新的两阶段训练方案，可以利用易于获得的数据集（即图像姿态对及无姿态视频）和预训练的文本到图像模型来获得可控制的角色视频。具体而言，在第一阶段，仅使用关键点 - 图像对进行可控文本到图像生成。我们在没有初始值的情况下学习一个卷积编码器来编码姿势信息。在第二阶段，我们通过添加可学习的时间自我注意和重构跨帧自我注意块，通过一个无姿态视频数据集微调上述网络的动作。凭借我们的新设计，我们的方法成功地生成了持续姿态可控的角色视频，同时保持了预先训练的文本到图像模型的编辑和概念组成能力。代码和模型将公开可用。

0

相关内容

视频生成

【ICML2023】基于自然语言指令的受控文本生成

【ICML2023】基于自然语言指令的受控文本生成

专知会员服务

29+阅读 · 2023年4月28日

【CVPR2023】高保真自由可控的说话头视频生成

【CVPR2023】高保真自由可控的说话头视频生成

专知会员服务

21+阅读 · 2023年4月22日

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

专知会员服务

40+阅读 · 2023年4月11日

【英国萨里大学】神经文本生成的研究进展:任务无关的综述，Recent Advances in Neural Text Generation: A Task-Agnostic Survey

【英国萨里大学】神经文本生成的研究进展:任务无关的综述，Recent Advances in Neural Text Generation: A Task-Agnostic Survey

专知会员服务

19+阅读 · 2022年3月8日

【CVPR 2021】姿态可控的语音驱动说话人脸

专知会员服务

16+阅读 · 2021年5月13日

【CVPR2021】GAN人脸预训练模型

【CVPR2021】GAN人脸预训练模型

专知会员服务

24+阅读 · 2021年4月10日

【CVPR2020】通过自适应GANs生成不同的图像，Diverse Image Generation via Self-Conditioned GANs

【CVPR2020】通过自适应GANs生成不同的图像，Diverse Image Generation via Self-Conditioned GANs

专知会员服务

34+阅读 · 2020年6月19日

【三维物体和手部姿态估计】综述论文最新进展，Recent Advances in 3D Object and Hand Pose Estimation

【三维物体和手部姿态估计】综述论文最新进展，Recent Advances in 3D Object and Hand Pose Estimation

专知会员服务

21+阅读 · 2020年6月13日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新八篇视频描述生成相关论文—在线视频理解、联合定位和描述事件、生成视频、跨模态注意力机制、联合事件检测和描述

【论文推荐】最新八篇视频描述生成相关论文—在线视频理解、联合定位和描述事件、生成视频、跨模态注意力机制、联合事件检测和描述

专知

11+阅读 · 2018年6月4日

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

专知

15+阅读 · 2018年5月15日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

Generative Adversarial Text to Image Synthesis论文解读

Generative Adversarial Text to Image Synthesis论文解读

统计学习与视觉计算组

13+阅读 · 2017年6月9日

基于全向深度视觉的高精度人体肢体运动三维重建研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于动态图模型与多元结构化在线学习的视觉目标跟踪

国家自然科学基金

0+阅读 · 2013年12月31日

结构化矢量图的模式样本合成与操控

国家自然科学基金

0+阅读 · 2013年12月31日

基于图片的交互式建筑物建模

国家自然科学基金

0+阅读 · 2012年12月31日

基于物联网的身份认证云计算平台研究

国家自然科学基金

0+阅读 · 2012年12月31日

全息-湿法刻蚀200nm周期X射线硅透射光栅

国家自然科学基金

0+阅读 · 2012年12月31日

复杂场景建模与超高清渲染技术

国家自然科学基金

0+阅读 · 2011年12月31日

基于边缘点的折反射图像立体匹配与三维重建研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于2D视频视觉关注度的3D重建方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于运动视觉的织物动态特性研究

国家自然科学基金

0+阅读 · 2008年12月31日

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs

Arxiv

0+阅读 · 2023年5月25日

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models

Arxiv

0+阅读 · 2023年5月25日

GenerateCT: Text-Guided 3D Chest CT Generation

Arxiv

0+阅读 · 2023年5月25日

SAMScore: A Semantic Structural Similarity Metric for Image Translation Evaluation

Arxiv

0+阅读 · 2023年5月24日

Visual Programming for Text-to-Image Generation and Evaluation

Arxiv

0+阅读 · 2023年5月24日

Attention to Mean-Fields for Particle Cloud Generation

Arxiv

0+阅读 · 2023年5月24日

Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction

Arxiv

0+阅读 · 2023年5月23日

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Arxiv

1+阅读 · 2023年5月23日

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Arxiv

10+阅读 · 2021年2月11日

Adversarial Multimodal Representation Learning for Click-Through Rate Prediction

Arxiv

23+阅读 · 2020年3月7日

VIP会员

文章信息

相关主题

相关VIP内容

【ICML2023】基于自然语言指令的受控文本生成

【ICML2023】基于自然语言指令的受控文本生成

专知会员服务

29+阅读 · 2023年4月28日

【CVPR2023】高保真自由可控的说话头视频生成

【CVPR2023】高保真自由可控的说话头视频生成

专知会员服务

21+阅读 · 2023年4月22日

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

【CVPR2023】Vita-CLIP:通过多模态提示的视频和文本自适应CLIP

专知会员服务

40+阅读 · 2023年4月11日

【英国萨里大学】神经文本生成的研究进展:任务无关的综述，Recent Advances in Neural Text Generation: A Task-Agnostic Survey

【英国萨里大学】神经文本生成的研究进展:任务无关的综述，Recent Advances in Neural Text Generation: A Task-Agnostic Survey

专知会员服务

19+阅读 · 2022年3月8日

【CVPR 2021】姿态可控的语音驱动说话人脸

专知会员服务

16+阅读 · 2021年5月13日

【CVPR2021】GAN人脸预训练模型

【CVPR2021】GAN人脸预训练模型

专知会员服务

24+阅读 · 2021年4月10日

【CVPR2020】通过自适应GANs生成不同的图像，Diverse Image Generation via Self-Conditioned GANs

【CVPR2020】通过自适应GANs生成不同的图像，Diverse Image Generation via Self-Conditioned GANs

专知会员服务

34+阅读 · 2020年6月19日

【三维物体和手部姿态估计】综述论文最新进展，Recent Advances in 3D Object and Hand Pose Estimation

【三维物体和手部姿态估计】综述论文最新进展，Recent Advances in 3D Object and Hand Pose Estimation

专知会员服务

21+阅读 · 2020年6月13日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《人工智能绝不能完全自主》

《人工智能的法律与伦理：军事自主机器独特挑战的深度剖析》316页

从数据到主导：AI与兵棋推演构筑决策优势

《特洛伊木马货柜：武器化集装箱的战略威胁》最新报告

相关资讯

文本+视觉，多篇 Visual/Video BERT 论文介绍

文本+视觉，多篇 Visual/Video BERT 论文介绍

AI科技评论

22+阅读 · 2019年8月30日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新八篇视频描述生成相关论文—在线视频理解、联合定位和描述事件、生成视频、跨模态注意力机制、联合事件检测和描述

【论文推荐】最新八篇视频描述生成相关论文—在线视频理解、联合定位和描述事件、生成视频、跨模态注意力机制、联合事件检测和描述

专知

11+阅读 · 2018年6月4日

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

专知

15+阅读 · 2018年5月15日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

Generative Adversarial Text to Image Synthesis论文解读

Generative Adversarial Text to Image Synthesis论文解读

统计学习与视觉计算组

13+阅读 · 2017年6月9日

相关论文

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs

Arxiv

0+阅读 · 2023年5月25日

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models

Arxiv

0+阅读 · 2023年5月25日

GenerateCT: Text-Guided 3D Chest CT Generation

Arxiv

0+阅读 · 2023年5月25日

SAMScore: A Semantic Structural Similarity Metric for Image Translation Evaluation

Arxiv

0+阅读 · 2023年5月24日

Visual Programming for Text-to-Image Generation and Evaluation

Arxiv

0+阅读 · 2023年5月24日

Attention to Mean-Fields for Particle Cloud Generation

Arxiv

0+阅读 · 2023年5月24日

Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction

Arxiv

0+阅读 · 2023年5月23日

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Arxiv

1+阅读 · 2023年5月23日

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling

Arxiv

10+阅读 · 2021年2月11日

Adversarial Multimodal Representation Learning for Click-Through Rate Prediction

Arxiv

23+阅读 · 2020年3月7日

相关基金

基于全向深度视觉的高精度人体肢体运动三维重建研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于动态图模型与多元结构化在线学习的视觉目标跟踪

国家自然科学基金

0+阅读 · 2013年12月31日

结构化矢量图的模式样本合成与操控

国家自然科学基金

0+阅读 · 2013年12月31日

基于图片的交互式建筑物建模

国家自然科学基金

0+阅读 · 2012年12月31日

基于物联网的身份认证云计算平台研究

国家自然科学基金

0+阅读 · 2012年12月31日

全息-湿法刻蚀200nm周期X射线硅透射光栅

国家自然科学基金

0+阅读 · 2012年12月31日

复杂场景建模与超高清渲染技术

国家自然科学基金

0+阅读 · 2011年12月31日

基于边缘点的折反射图像立体匹配与三维重建研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于2D视频视觉关注度的3D重建方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于运动视觉的织物动态特性研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员