Large transformers are powerful architectures for self-supervised analysis of data of various nature, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We here characterize the geometric and statistical properties of these representations, focusing on the evolution of such proprieties across the layers. By analyzing geometric properties such as the intrinsic dimension (ID) and the neighbor composition we find that the representations evolve in a strikingly similar manner in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then it contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic complexity of the dataset emerges at the end of the first peak. This phenomenon can be observed across many models trained on diverse datasets. Based on these observations, we suggest using the ID profile as an unsupervised proxy to identify the layers which are more suitable for downstream learning tasks.
翻译:大型变压器是自监督分析各种性质数据(从蛋白质序列到文字到图像等)的强大结构。 在这些模型中,隐藏层中的数据表示形式生活在同一个空间中,数据集的语义结构通过一个表达式和下一个表达式之间功能相同的变化序列出现。 我们在这里描述这些表达式的几何和统计特性, 重点是这些特性在各层之间的演进。 通过分析诸如内在维度( ID) 和邻居构成等几何特性, 我们发现在接受过蛋白语言任务和图像重建任务培训的变压器中, 表达方式以惊人相似的方式演变。 在第一层中, 数据元扩展, 成为高维, 然后在中间层中大量收缩 。 在模型的最后一个部分, ID 保持大约恒定或形成第二个浅色峰。 我们显示, 数据集的语义复杂性在第一个峰末出现。 这种现象可以在许多经过不同数据集培训的模型中观察到。 基于这些观测结果, 我们建议使用ID 配置为更适合的下游层学习。