Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al. 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.
翻译:解释性研究旨在建立理解机器学习(ML)模型的工具。 但是,这些工具本身很难评估, 因为我们没有关于ML模型实际作用的地面真实信息。 在这项工作中, 我们提议手工建立变压器模型, 作为可解释性研究的测试台。 我们引入了Tracr, 将人读程序转换成变压器模型的重量的“ compiler ” 。 Tracr 使用一种域名语言( Weiss et al. 2021) 的 RASP 代码, 并将其转换成标准、 解码器、 类似 GPT 的变压器结构的重量。 我们使用Tracr 创建了一系列地面变压器, 实施程序, 包括计算符号频率、 排序和 Dyck- n 母体检查等。 为了让更广泛的研究界探索和使用编译的模型, 我们在 https://github.com/ deepmind/tracr提供Tracr 的公开源实施 。