We construct an unsupervised learning model that achieves nonlinear disentanglement of underlying factors of variation in naturalistic videos. Previous work suggests that representations can be disentangled if all but a few factors in the environment stay constant at any point in time. As a result, algorithms proposed for this problem have only been tested on carefully constructed datasets with this exact property, leaving it unclear whether they will transfer to natural scenes. Here we provide evidence that objects in segmented natural movies undergo transitions that are typically small in magnitude with occasional large jumps, which is characteristic of a temporally sparse distribution. We leverage this finding and present SlowVAE, a model for unsupervised representation learning that uses a sparse prior on temporally adjacent observations to disentangle generative factors without any assumptions on the number of changing factors. We provide a proof of identifiability and show that the model reliably learns disentangled representations on several established benchmark datasets, often surpassing the current state-of-the-art. We additionally demonstrate transferability towards video datasets with natural dynamics, Natural Sprites and KITTI Masks, which we contribute as benchmarks for guiding disentanglement research towards more natural data domains.
翻译:我们构建了一个不受监督的学习模式, 实现自然视频差异的内在因素的非线性分解。 先前的工作表明, 如果环境环境中除了少数因素之外的所有因素在任何时间点都保持不变, 表达方式是可以分解的。 结果, 为这一问题提出的算法仅在精心构建的数据集中测试过, 并使用这种精确属性, 使得人们不清楚它们是否会转移到自然场景。 我们在这里提供的证据显示, 分块自然电影中的对象的过渡过程通常规模很小, 偶尔会发生大跳跃, 这是时间性分散分布的特点。 我们利用这一发现并展示了 SlowVAE, 这是一种不受监督的演示学习模式, 用于在时间相邻的观测中, 利用一个稀少的、 暂时相邻的观察来解开基因变异因素, 而不对变化因素的数量作任何假设。 我们提供了一种辨别性的证据, 并表明, 可靠地显示, 该模型在几个既定的基准数据集上, 往往超过目前的状态。 我们还展示了向视频数据设置的转移的可能性, 与自然动态、 自然斯普拉特和KITTImaskes maskes, 我们作为指导自然数据域域的基准。