We propose a new representation of visual data that disentangles object position from appearance. Our method, termed Deep Latent Particles (DLP), decomposes the visual input into low-dimensional latent ``particles'', where each particle is described by its spatial location and features of its surrounding region. To drive learning of such representations, we follow a VAE-based approach and introduce a prior for particle positions based on a spatial-softmax architecture, and a modification of the evidence lower bound loss inspired by the Chamfer distance between particles. We demonstrate that our DLP representations are useful for downstream tasks such as unsupervised keypoint (KP) detection, image manipulation, and video prediction for scenes composed of multiple dynamic objects. In addition, we show that our probabilistic interpretation of the problem naturally provides uncertainty estimates for particle locations, which can be used for model selection, among other tasks. Videos and code are available: https://taldatech.github.io/deep-latent-particles-web/
翻译:我们建议用新的视觉数据表示将物体从外观位置分解出来。 我们的方法叫做深液粒子(DLP),将视觉输入分解成低维潜值“粒子 ”, 每个粒子都以其空间位置和周围区域的特征来描述。 为了学习这种表达, 我们采用以VAE为基础的方法, 并采用基于空间软体结构的粒子位置前置方法, 并修改由沙弗粒子之间距离引发的较低约束损失的证据。 我们证明我们的DLP表情对下游任务很有用, 如由多个动态物体构成的图像检测、图像操作和视频预测等。 此外, 我们还表明我们对问题的概率性解释自然地提供了粒子位置的不确定性估计, 用于模型选择。 视频和代码有: https://taldatech.github. io/deep-latent-part-artic-artic-articleweb/web/ 。