This paper addresses a fundamental question: how good are our current self-supervised visual representation learning algorithms relative to humans? More concretely, how much "human-like", natural visual experience would these algorithms need in order to reach human-level performance in a complex, realistic visual object recognition task such as ImageNet? Using a scaling experiment, here we estimate that the answer is on the order of a million years of natural visual experience, in other words several orders of magnitude longer than a human lifetime. However, this estimate is quite sensitive to some underlying assumptions, underscoring the need to run carefully controlled human experiments. We discuss the main caveats surrounding our estimate and the implications of this rather surprising result.
翻译:本文涉及一个根本问题:我们目前自我监督的视觉表现学习算法相对于人类来说有多好?更具体地说,这些算法需要多少“像人一样”的自然视觉经验才能在像图像网络这样的复杂、现实的视觉物体识别任务中达到人的水平性表现?我们用一个缩放实验估计答案是大约100万年的自然视觉经验,换句话说,比人类寿命长的几个数量级。然而,这一估计对于一些基本假设相当敏感,这突出说明了进行谨慎控制的人类实验的必要性。我们讨论了关于我们的估计的主要警告以及这一相当令人惊讶的结果的影响。