Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling. We decompose this problem into three easier subtasks, and provide candidate solutions for each of them. Inspired by the Common Fate Principle of Gestalt Psychology, we first extract (noisy) masks of moving objects via unsupervised motion segmentation. Second, generative models are trained on the masks of the background and the moving objects, respectively. Third, background and foreground models are combined in a conditional "dead leaves" scene model to sample novel scene configurations where occlusions and depth layering arise naturally. To evaluate the individual stages, we introduce the Fishbowl dataset positioned between complex real-world scenes and common object-centric benchmarks of simplistic objects. We show that our approach allows learning generative models that generalize beyond the occlusions present in the input videos, and represent scenes in a modular fashion that allows sampling plausible scenes outside the training distribution by permitting, for instance, object numbers or densities not observed in the training set.
翻译:从未贴标签的视频中学习基因化物体模型是一个长期存在的问题,是因果场景建模所需要的。 我们将这一问题分解成三个比较容易的子任务, 并为其中每个任务提供候选解决方案。 在Gestalt心理学共同归宿原则的启发下, 我们首先通过不受监督的动作分割提取( 噪音) 移动对象的遮罩。 第二, 基因模型分别针对背景和移动对象的遮罩进行训练。 第三, 背景和前景模型在有条件的“ 死叶” 场景模型中结合, 以抽样新颖的场景配置, 其中隐蔽和深度层自然产生。 为了评估各个阶段, 我们引入了位于复杂的真实世界场景和简单对象的通用目标中心基准之间的Fishbowl数据集。 我们展示了我们的方法可以学习超越输入视频视频中的封闭面外的基因化模型, 并以模块方式代表场景, 允许在培训分布之外取样合理场景, 例如, 允许在训练场景中不观察到对象数或密度 。