Transformers have become the de facto models of choice in machine learning, typically leading to impressive performance on many applications. At the same time, the architectural development in the transformer world is mostly driven by empirical findings, and the theoretical understanding of their architectural building blocks is rather limited. In contrast, Dense Associative Memory models or Modern Hopfield Networks have a well-established theoretical foundation, but have not yet demonstrated truly impressive practical results. We propose a transformer architecture that replaces the sequence of feedforward transformer blocks with a single large Associative Memory model. Our novel architecture, called Energy Transformer (or ET for short), has many of the familiar architectural primitives that are often used in the current generation of transformers. However, it is not identical to the existing architectures. The sequence of transformer layers in ET is purposely designed to minimize a specifically engineered energy function, which is responsible for representing the relationships between the tokens. As a consequence of this computational principle, the attention in ET is different from the conventional attention mechanism. In this work, we introduce the theoretical foundations of ET, explore it's empirical capabilities using the image completion task, and obtain strong quantitative results on the graph anomaly detection task.
翻译:在机器学习中,变异器已成为事实上的选择模式,通常导致在许多应用中取得令人印象深刻的成绩。与此同时,变异器世界的建筑发展大多是由经验调查结果驱动的,对其建筑建筑构件的理论理解相当有限。相反,常量联合内存模型或现代Hopfield 网络有一个牢固的理论基础,但尚未显示出真正令人印象深刻的实际结果。我们建议了一个变异器结构,用一个单一的大型联合内存模型来取代进料向变异器块的序列。我们的新结构,称为能源变异器(简称ET短短),有许多熟悉的建筑原始材料,经常用于当代变异器。然而,它与现有的结构并不完全相同。ET变异器层的顺序旨在最大限度地减少专门设计的能源功能,它负责代表象征物之间的关系。由于这一计算原则,ET的注意力与常规关注机制不同。在这项工作中,我们引入了ET的理论基础,利用图像完成任务来探索它的经验性检测能力,并获得可靠的定量。