Disentangled representations support a range of downstream tasks including causal reasoning, generative modeling, and fair machine learning. Unfortunately, disentanglement has been shown to be impossible without the incorporation of supervision or inductive bias. Given that supervision is often expensive or infeasible to acquire, we choose to incorporate structural inductive bias and present an unsupervised, deep State-Space-Model for Video Disentanglement (VDSM). The model disentangles latent time-varying and dynamic factors via the incorporation of hierarchical structure with a dynamic prior and a Mixture of Experts decoder. VDSM learns separate disentangled representations for the identity of the object or person in the video, and for the action being performed. We evaluate VDSM across a range of qualitative and quantitative tasks including identity and dynamics transfer, sequence generation, Fr\'echet Inception Distance, and factor classification. VDSM provides state-of-the-art performance and exceeds adversarial methods, even when the methods use additional supervision.
翻译:分解的表示方式支持一系列下游任务,包括因果关系推理、基因模型和公平机器学习。 不幸的是,分解已经证明在没有纳入监督或感化偏差的情况下是不可能的。鉴于监督往往费用昂贵或难以获得,我们选择纳入结构性的诱导偏差,并提供一个不受监督的、深层的国家空间-视频分解模型(VDSM),模型通过将等级结构与前动态和专家解密混集在一起来分解潜伏的时间和动态因素。VDSM了解视频中对象或人的身份和所采取行动的不同分解的表达方式。我们从一系列定性和定量任务中评估VDSM,包括身份和动态转移、序列生成、Fr\'echet感知距离和要素分类。VDSM提供最先进的性能和超过对抗性方法,即使方法使用额外的监督。