Biological vision systems are unparalleled in their ability to learn visual representations without supervision. In machine learning, self-supervised learning (SSL) has led to major advances in forming object representations in an unsupervised fashion. Such systems learn representations invariant to augmentation operations over images, like cropping or flipping. In contrast, biological vision systems exploit the temporal structure of the visual experience during natural interactions with objects. This gives access to "augmentations" not commonly used in SSL, like watching the same object from multiple viewpoints or against different backgrounds. Here, we systematically investigate and compare the potential benefits of such time-based augmentations during natural interactions for learning object categories. Our results show that time-based augmentations achieve large performance gains over state-of-the-art image augmentations. Specifically, our analyses reveal that: 1) 3-D object manipulations drastically improve the learning of object categories; 2) viewing objects against changing backgrounds is important for learning to discard background-related information from the latent representation. Overall, we conclude that time-based augmentations during natural interactions with objects can substantially improve self-supervised learning, narrowing the gap between artificial and biological vision systems.
翻译:在机器学习中,自我监督的学习(SSL)导致在以不受监督的方式形成物体表达方式方面取得重大进展。这些系统通过图像(如裁剪或翻转)学习了增强操作的变异性。相反,生物视觉系统在与物体的自然互动中利用视觉经验的时间结构。这提供了在SSL中并不常用的“放大”功能,就像从多重角度或从不同背景观察同一对象一样。在这里,我们系统地调查和比较在自然互动中这种基于时间的增强在学习对象类别中的潜在好处。我们的结果显示,基于时间的增强在最新图像增强方面取得了很大的性能收益。具体地说,我们的分析表明:(1) 3D对象的操纵极大地改进了对象类别的学习;(2) 对变化背景的物体进行观察对于学习从潜在代表角度丢弃与背景有关的信息十分重要。总体而言,我们的结论是,在与物体自然互动中基于时间的增强可以大大改进自我监控的学习,缩小人造和生物视觉系统之间的差距。