Recently introduced self-supervised methods for image representation learning provide on par or superior results to their fully supervised competitors, yet the corresponding efforts to explain the self-supervised approaches lag behind. Motivated by this observation, we introduce a novel visual probing framework for explaining the self-supervised models by leveraging probing tasks employed previously in natural language processing. The probing tasks require knowledge about semantic relationships between image parts. Hence, we propose a systematic approach to obtain analogs of natural language in vision, such as visual words, context, and taxonomy. Our proposal is grounded in Marr's computational theory of vision and concerns features like textures, shapes, and lines. We show the effectiveness and applicability of those analogs in the context of explaining self-supervised representations. Our key findings emphasize that relations between language and vision can serve as an effective yet intuitive tool for discovering how machine learning models work, independently of data modality. Our work opens a plethora of research pathways towards more explainable and transparent AI.
翻译:最近引入的自我监督的图像代表学习方法为完全受监督的竞争者提供了同等或优异的结果,然而,解释自我监督方法的相应努力却落后于以往。受这一观察的驱动,我们引入了一个新的视觉探索框架,通过利用先前在自然语言处理中使用的探索任务来解释自我监督模型。检验任务要求了解图像部分之间的语义关系。因此,我们提出一种系统的方法,以获得视觉中自然语言的模拟,如视觉文字、背景和分类学。我们的建议基于Marr的视觉和图案、形状和线条等特征的计算理论。我们展示了这些模拟在解释自我监督的表达方式方面的有效性和适用性。我们的主要结论强调,语言和愿景之间的关系可以作为一个有效而又直观的工具,用以发现机器学习模型如何运作,而独立于数据模式。我们的工作开辟了通向更可解释和透明的AI的众多研究途径。