This paper introduces a general method for the exploration of equivalence classes in the input space of Transformer models. The proposed approach is based on sound mathematical theory which describes the internal layers of a Transformer architecture as sequential deformations of the input manifold. Using eigendecomposition of the pullback of the distance metric defined on the output space through the Jacobian of the model, we are able to reconstruct equivalence classes in the input space and navigate across them. Our method enables two complementary exploration procedures: the first retrieves input instances that produce the same class probability distribution as the original instance-thus identifying elements within the same equivalence class-while the second discovers instances that yield a different class probability distribution, effectively navigating toward distinct equivalence classes. Finally, we demonstrate how the retrieved instances can be meaningfully interpreted by projecting their embeddings back into a human-readable format.
翻译:本文提出了一种在Transformer模型输入空间中探索等价类的通用方法。所提出的方法基于严谨的数学理论,该理论将Transformer架构的内部层描述为输入流形的连续形变。通过利用模型雅可比矩阵将输出空间距离度量拉回后的特征分解,我们能够重构输入空间中的等价类并在其间进行遍历。我们的方法实现了两种互补的探索流程:第一种检索能产生与原始实例相同类别概率分布的输入实例——从而识别同一等价类内的元素;第二种则发现能产生不同类别概率分布的实例,有效遍历至不同的等价类。最后,我们通过将检索实例的嵌入向量投影回人类可读格式,展示了如何对这些实例进行有意义的解释。