While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.
翻译:尽管对Transformer模型工作机制的洞察主要源于对其在语言任务中行为的分析,本研究通过组合性的视角探究了视觉Transformer(ViT)编码器所学习到的表示。我们引入了一个与先前衡量表示学习中组合性的工作相类似的框架,用以检验ViT编码器中的组合性。建立这种类比的关键在于离散小波变换(DWT),它是一种在视觉场景中获取输入相关基元的简单而有效的工具。通过检验组合表示重构原始图像表示的能力,我们实证测试了表示空间在多大程度上遵循组合性。我们的研究结果表明,来自单层DWT分解的基元所产生的编码器表示在潜在空间中近似可组合,这为理解ViT如何组织信息提供了新的视角。