MoCoGAN 分解运动和内容的视频生成

2017 年 10 月 21 日 CreateAMind
MoCoGAN 分解运动和内容的视频生成

MoCoGAN: Decomposing Motion and Content for Video Generation

This repository contains an implementation and further details of MoCoGAN: Decomposing Motion and Content for Video Generation by Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, Jan Kautz.


MoCoGAN is a generative model for videos, which generates videos from random inputs. It features separated representations of motion and content, offering control over what is generated. For example, MoCoGAN can generate the same object performing different actions, as well as the same action performed by different objects

Examples of generated videos

We trained MoCoGAN on the MUG Facial Expression Database to generate facial expressions. When fixing the content code and changing the motion code, it generated the same person performs different expressions. When fixing the motion code and changing the content code, it generated different people performs the same expression. In the figure shown below, each column has fixed identity, each row shows the same action:

We trained MoCoGAN on a human action dataset where content is represented by the performer, executing several actions. When fixing the content code and changing the motion code, it generated the same person performs different actions. When fixing the motion code and changing the content code, it generated different people performs the same action. Each pair of images represents the same action executed by different people:

We have collected a large-scale TaiChi dataset including 4.5K videos of TaiChi performers. Below are videos generated by MoCoGAN.

Training MoCoGAN

Please refer to a wiki page


If you use MoCoGAN in your research please cite our paper:

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, Jan Kautz, "MoCoGAN: Decomposing Motion and Content for Video Generation"



Ming-Yu Liu是英伟达著名的研究科学家。在2016年加入NVIDIA之前,他是三菱电机研究实验室(MERL)的首席研究科学家。2012年,他获得了马里兰大学帕克学院电子与计算机工程系的博士学位。2014年,他的机器人拣料系统获得了《R&D》杂志颁发的R&D 100奖。他的语义图像合成论文和场景理解论文分别在2019年CVPR和2015年RSS会议上入围最佳论文决赛。在2019年的SIGGRAPH上,他的图像合成作品获得了实时直播秀的最佳表演奖和观众选择奖。他的研究重点是生成图像建模。他的目标是使机器具有类人的想象力。

Despite the success of Generative Adversarial Networks (GANs), mode collapse remains a serious issue during GAN training. To date, little work has focused on understanding and quantifying which modes have been dropped by a model. In this work, we visualize mode collapse at both the distribution level and the instance level. First, we deploy a semantic segmentation network to compare the distribution of segmented objects in the generated images with the target distribution in the training set. Differences in statistics reveal object classes that are omitted by a GAN. Second, given the identified omitted object classes, we visualize the GAN's omissions directly. In particular, we compare specific differences between individual photos and their approximate inversions by a GAN. To this end, we relax the problem of inversion and solve the tractable problem of inverting a GAN layer instead of the entire generator. Finally, we use this framework to analyze several recent GANs trained on multiple datasets and identify their typical failure cases.


We present the first method to capture the 3D total motion of a target person from a monocular view input. Given an image or a monocular video, our method reconstructs the motion from body, face, and fingers represented by a 3D deformable mesh model. We use an efficient representation called 3D Part Orientation Fields (POFs), to encode the 3D orientations of all body parts in the common 2D image space. POFs are predicted by a Fully Convolutional Network (FCN), along with the joint confidence maps. To train our network, we collect a new 3D human motion dataset capturing diverse total body motion of 40 subjects in a multiview system. We leverage a 3D deformable human model to reconstruct total body pose from the CNN outputs by exploiting the pose and shape prior in the model. We also present a texture-based tracking method to obtain temporally coherent motion capture output. We perform thorough quantitative evaluations including comparison with the existing body-specific and hand-specific methods, and performance analysis on camera viewpoint and human pose changes. Finally, we demonstrate the results of our total body motion capture on various challenging in-the-wild videos. Our code and newly collected human motion dataset will be publicly shared.


We present DeblurGAN, an end-to-end learned method for motion deblurring. The learning is based on a conditional GAN and the content loss . DeblurGAN achieves state-of-the art performance both in the structural similarity measure and visual appearance. The quality of the deblurring model is also evaluated in a novel way on a real-world problem -- object detection on (de-)blurred images. The method is 5 times faster than the closest competitor -- DeepDeblur. We also introduce a novel method for generating synthetic motion blurred images from sharp ones, allowing realistic dataset augmentation. The model, code and the dataset are available at


We present FusedGAN, a deep network for conditional image synthesis with controllable sampling of diverse images. Fidelity, diversity and controllable sampling are the main quality measures of a good image generation model. Most existing models are insufficient in all three aspects. The FusedGAN can perform controllable sampling of diverse images with very high fidelity. We argue that controllability can be achieved by disentangling the generation process into various stages. In contrast to stacked GANs, where multiple stages of GANs are trained separately with full supervision of labeled intermediate images, the FusedGAN has a single stage pipeline with a built-in stacking of GANs. Unlike existing methods, which requires full supervision with paired conditions and images, the FusedGAN can effectively leverage more abundant images without corresponding conditions in training, to produce more diverse samples with high fidelity. We achieve this by fusing two generators: one for unconditional image generation, and the other for conditional image generation, where the two partly share a common latent space thereby disentangling the generation. We demonstrate the efficacy of the FusedGAN in fine grained image generation tasks such as text-to-image, and attribute-to-face generation.

gan生成图像at 1024² 的 代码 论文
4+阅读 · 2017年10月31日
Auto-Encoding GAN
5+阅读 · 2017年8月4日
Seeing What a GAN Cannot Generate
David Bau,Jun-Yan Zhu,Jonas Wulff,William Peebles,Hendrik Strobelt,Bolei Zhou,Antonio Torralba
6+阅读 · 2019年10月24日
Two-phase Hair Image Synthesis by Self-Enhancing Generative Model
Haonan Qiu,Chuan Wang,Hang Zhu,Xiangyu Zhu,Jinjin Gu,Xiaoguang Han
3+阅读 · 2019年2月28日
Monocular Total Capture: Posing Face, Body, and Hands in the Wild
Donglai Xiang,Hanbyul Joo,Yaser Sheikh
4+阅读 · 2018年12月4日
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
Ting-Chun Wang,Ming-Yu Liu,Jun-Yan Zhu,Andrew Tao,Jan Kautz,Bryan Catanzaro
3+阅读 · 2018年8月20日
Huiting Hong,Xin Li,Mingzhong Wang
4+阅读 · 2018年5月21日
Xuelin Qian,Yanwei Fu,Tao Xiang,Wenxuan Wang,Jie Qiu,Yang Wu,Yu-Gang Jiang,Xiangyang Xue
5+阅读 · 2018年4月25日
Orest Kupyn,Volodymyr Budzan,Mykola Mykhailych,Dmytro Mishkin,Jiri Matas
3+阅读 · 2018年4月3日
Xinlei Chen,Li-Jia Li,Li Fei-Fei,Abhinav Gupta
3+阅读 · 2018年3月29日
You Xie,Erik Franz,Mengyu Chu,Nils Thuerey
5+阅读 · 2018年1月29日
Navaneeth Bodla,Gang Hua,Rama Chellappa
8+阅读 · 2018年1月17日