Understanding which inductive biases could be helpful for the unsupervised learning of object-centric representations of natural scenes is challenging. In this paper, we systematically investigate the performance of two models on datasets where neural style transfer was used to obtain objects with complex textures while still retaining ground-truth annotations. We find that by using a single module to reconstruct both the shape and visual appearance of each object, the model learns more useful representations and achieves better object separation. In addition, we observe that adjusting the latent space size is insufficient to improve segmentation performance. Finally, the downstream usefulness of the representations is significantly more strongly correlated with segmentation quality than with reconstruction accuracy.
翻译:在本文中,我们系统地调查了两个数据集模型的性能,在这两个模型中,神经风格传输被用于获取具有复杂质谱的物体,同时仍然保留地面实况说明。我们发现,通过使用单一模块来重建每个物体的形状和视觉外观,模型学到了更有用的外观,并实现了更好的天体分离。此外,我们发现,调整潜在空间面积不足以改善分割性性能。最后,这些外观的下游效用与分解质量比与重建准确性大得多。