Deep-learning vision models have shown intriguing similarities and differences with respect to human vision. We investigate how to bring machine visual representations into better alignment with human representations. Human representations are often inferred from behavioral evidence such as the selection of an image most similar to a query image. We find that with appropriate linear transformations of deep embeddings, we can improve prediction of human binary choice on a data set of bird images from 72% at baseline to 89%. We hypothesized that deep embeddings have redundant, high (4096) dimensional representations; however, reducing the rank of these representations results in a loss of explanatory power. We hypothesized that the dilation transformation of representations explored in past research is too restrictive, and indeed we found that model explanatory power can be significantly improved with a more expressive linear transform. Most surprising and exciting, we found that, consistent with classic psychological literature, human similarity judgments are asymmetric: the similarity of X to Y is not necessarily equal to the similarity of Y to X, and allowing models to express this asymmetry improves explanatory power.
翻译:深层学习的视觉模型在人类视觉方面表现出令人感兴趣的相似性和差异。 我们调查如何使机器视觉显示与人类形象更趋一致。 人类的表示往往从行为证据中推断,例如选择与查询图像最相似的图像。 我们发现,通过深层嵌入图象的适当线性转换,我们可以更好地预测人类在鸟图像数据集上的二进制选择,从基线的72%到89 % 。 我们假设深层嵌入图象是多余的,高(4096年)维度的表示; 但是, 降低这些表象的等级导致解释力的丧失。 我们假设过去研究中探索的表象的放大变形过于严格,而且我们确实发现模型解释力可以通过更清晰的线性转变得到显著的改善。 最令人惊讶和令人兴奋的是,我们发现,与经典的心理学文献一样,人类相似性判断是不对称的:X和Y的相似性不一定等于Y至X的相似性,而允许模型表达这种不对称性改善解释力。