Humans can identify objects following various spatial transformations such as scale and viewpoint. This extends to novel objects, after a single presentation at a single pose, sometimes referred to as online invariance. CNNs have been proposed as a compelling model of human vision, but their ability to identify objects across transformations is typically tested on held-out samples of trained categories after extensive data augmentation. This paper assesses whether standard CNNs can support human-like online invariance by training models to recognize images of synthetic 3D objects that undergo several transformations: rotation, scaling, translation, brightness, contrast, and viewpoint. Through the analysis of models' internal representations, we show that standard supervised CNNs trained on transformed objects can acquire strong invariances on novel classes even when trained with as few as 50 objects taken from 10 classes. This extended to a different dataset of photographs of real objects. We also show that these invariances can be acquired in a self-supervised way, through solving the same/different task. We suggest that this latter approach may be similar to how humans acquire invariances.
翻译:人类可以识别在各种空间变换(如规模和视角)之后的物体。 这可以扩展到新对象, 在以单一姿势(有时被称为在线变迁)进行单一展示后。 有线电视新闻网被提议为人类视觉的令人信服的模型,但是,在广泛数据增强后,他们识别变异对象的能力通常会通过训练有素的类别样本来测试。本文评估标准有线电视新闻网能否通过培训模型支持人种式的在线变异,通过培训模型来识别经过若干变异的合成三维对象的图像:旋转、缩放、翻译、亮度、对比和观点。通过分析模型的内部表现,我们显示在变异对象方面接受过训练的受监管的有线电视新闻网可以在小类中获取强烈的变异性,即使经过培训的有来自10个类的50个对象。这扩大到真实物体照片的不同数据集。我们还表明,这些变异性可以通过自我监督的方式,通过解决同样/不同的任务获得。 我们建议后一种方法可能与人类变异性如何获得相似。