转换器模型在简单类型λ演算中的类型推断中的应用：一个代码深度学习案例研究 (Transformer Models for Type Inference in the Simply Typed Lambda Calculus: A Case Study in Deep Learning for Code)

Despite a growing body of work at the intersection of deep learning and formal languages, there has been relatively little systematic exploration of transformer models for reasoning about typed lambda calculi. This is an interesting area of inquiry for two reasons. First, typed lambda calculi are the lingua franc of programming languages. A set of heuristics that relate various typed lambda calculi to effective neural architectures would provide a systematic method for mapping language features (e.g., polymorphism, subtyping, inheritance, etc.) to architecture choices. Second, transformer models are widely used in deep learning architectures applied to code, but the design and hyperparameter space for them is large and relatively unexplored in programming language applications. Therefore, we suggest a benchmark that allows us to explore exactly this through perhaps the simplest and most fundamental property of a programming language: the relationship between terms and types. Consequently, we begin this inquiry of transformer architectures for typed lambda calculi by exploring the effect of transformer warm-up and optimizer selection in the task of type inference: i.e., predicting the types of lambda calculus terms using only transformers. We find that the optimization landscape is difficult even in this simple setting. One particular experimental finding is that optimization by Adafactor converges much faster compared to the optimization by Adam and RAdam. We conjecture that such different performance of optimizers might be related to the difficulties of generalization over formally generated dataset.

翻译：尽管在深度学习和形式语言交叉领域已经有了大量的研究工作，但对于用于类型 λ 演算推理的变压器模型的系统性探索相对较少。这是一个有趣的研究领域，原因在于：类型 λ 演算是编程语言的通用语言；关于如何将语言特征（如多态性、子类型、继承等）与架构选择建立关系的一组启发式算法可以提供一种系统方法；变压器模型广泛地应用于代码中的深度学习架构，但与编程语言的应用相比，其设计和超参数空间是巨大而相对未被探索的。因此，我们建议一个基准案例，通过可能是程序语言中最简单、最基本的属性之一（即术语和类型之间的关系）来探索这一点。因此，我们通过使用变压器来预测λ演算术语的类型，研究了变压器架构在λ演算中的应用。我们发现，即使在这个简单的设置中，优化的最优解也很难寻找。一项特别的实验证明了，与Adam和RAdam优化相比，Adafactor的优化更快地收敛。我们猜想，这种优化器不同的表现可能与正式生成的数据集的泛化困难有关。