Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can the inner decision process of neural networks be captured symbolically in some familiar logic? We show that any fixed-precision transformer neural network can be translated into an equivalent fixed-size $\mathsf{FO}(\mathsf{M})$ formula, i.e., a first-order logic formula that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. The proof idea is to design highly uniform boolean threshold circuits that can simulate transformers, and then leverage known theoretical connections between circuits and logic. Our results reveal a surprisingly simple formalism for capturing the behavior of transformers, show that simple problems like integer division are "transformer-hard", and provide valuable insights for comparing transformers to other models like RNNs. Our results suggest that first-order logic with majority may be a useful language for expressing programs extracted from transformers.
翻译:神经网络内计算隐含结构的特性是深度学习解释领域的一个基本问题。 神经网络的内部决定过程能否在某些熟悉的逻辑中被象征性地捕捉到? 我们显示, 任何固定精密变压器神经网络都可以被转换成等量的固定大小$mathsf{FO}(\\mathsf{FO})(\\mathsf{M})$公式, 即一阶逻辑公式, 除了标准的通用和存在性量化标准外, 该公式还可能包含多数投票的量化标准。 证明的理念是设计高度统一的布林阈阈电路, 能够模拟变压器, 然后利用已知的电路和逻辑之间的理论联系。 我们的结果揭示出一种惊人的简单形式主义, 用来捕捉变压器的行为, 表明整数分法是“ 硬化的 ”, 并且为将变压器与其他模型比较提供了宝贵的见解。 我们的结果表明, 多数的初等逻辑可能是表达变压器的有用语言。