Understanding which inductive biases could be helpful for the unsupervised learning of object-centric representations of natural scenes is challenging. We use neural style transfer to generate datasets where objects have complex textures while still retaining ground-truth annotations. We find that methods that use a single module to reconstruct both the shape and visual appearance of each object learn more useful representations and achieve better object separation. In addition, we observe that adjusting the latent space size is not sufficient to improve segmentation performance. Finally, the downstream usefulness of the representations is significantly more strongly correlated with segmentation quality than with reconstruction accuracy.
翻译:我们使用神经风格传输生成数据集,当物体有复杂的质地,同时仍然保留地面实况说明。我们发现,使用单一模块来重建每个物体的形状和视觉外观的方法会学习更有用的表达方式,并实现更好的物体分离。此外,我们发现,调整潜伏空间面积不足以改善分割性能。最后,下游的表达方式的有用性比重建准确性更强烈。