Over the last years, advancements in deep learning models for computer vision have led to a dramatic improvement in their image classification accuracy. However, models with a higher accuracy in the task they were trained on do not necessarily develop better image representations that allow them to also perform better in other tasks they were not trained on. In order to investigate the representation learning capabilities of prominent high-performing computer vision models, we investigated how well they capture various indices of perceptual similarity from large-scale behavioral datasets. We find that higher image classification accuracy rates are not associated with a better performance on these datasets, and in fact we observe no improvement in performance since GoogLeNet (released 2015) and VGG-M (released 2014). We speculate that more accurate classification may result from hyper-engineering towards very fine-grained distinctions between highly similar classes, which does not incentivize the models to capture overall perceptual similarities.
翻译:在过去几年里,计算机视觉深层学习模型的进步导致其图像分类精确度的大幅提高,然而,在所培训任务精度较高的模型不一定能够发展出更好的图像表现,使其也能够更好地完成未培训的其他任务。为了调查杰出高性能计算机视觉模型的代表性学习能力,我们调查了这些模型如何很好地捕捉了大规模行为数据集的各种概念相似性指数。我们发现,高图像分类精确率与这些数据集的更好性能没有关系,事实上,我们注意到自GoogLeNet(发布于2015年)和VGG-M(发布于2014年)以来,业绩没有改善。我们推测,高超工程可能会导致更准确的分类,导致高度相似的班级之间的非常精细的区分,而高超工程并没有激励模型来捕捉总体概念上的相似性能。</s>