Parsing is a fundamental building block in modern compilers, and for industrial programming languages, it is a surprisingly involved task. There are known approaches to generate parsers automatically, but the prevailing consensus is that automatic parser generation is not practical for real programming languages: LR/LALR parsers are considered to be far too restrictive in the grammars they support, and LR parsers are often considered too inefficient in practice. As a result, virtually all modern languages use recursive-descent parsers written by hand, a lengthy and error-prone process that dramatically increases the barrier to new programming language development. In this work we demonstrate that, contrary to the prevailing consensus, we can have the best of both worlds: for a very general, practical class of grammars -- a strict superset of Knuth's canonical LR -- we can generate parsers automatically, and the resulting parser code, as well as the generation procedure itself, is highly efficient. This advance relies on several new ideas, including novel automata optimization procedures; a new grammar transformation ("CPS"); per-symbol attributes; recursive-descent actions; and an extension of canonical LR parsing, which we refer to as XLR, which endows shift/reduce parsers with the power of bounded nondeterministic choice. With these ingredients, we can automatically generate efficient parsers for virtually all programming languages that are intuitively easy to parse -- a claim we support experimentally, by implementing the new algorithms in a new software tool called langcc, and running them on syntax specifications for Golang 1.17.8 and Python 3.9.12. The tool handles both languages automatically, and the generated code, when run on standard codebases, is 1.2x faster than the corresponding hand-written parser for Golang, and 4.3x faster than the CPython parser, respectively.
翻译:剖析是现代编译器和工业编程语言的基本构件, 令人惊讶地涉及到它。 有已知的自动生成剖析器的方法, 但普遍共识是, 自动剖析器生成对真实编程语言来说并不实用: LR/ LALR 剖析器被认为在它们支持的语法中限制过强, 而 LR 剖析器在实践中常常被认为太低。 因此, 几乎所有现代语言都使用手写的循环- 白白谱读器, 一个漫长和错误易变的过程, 大大增加了新编程语言发展的障碍。 在此工作中, 我们证明, 与普遍一致的共识相反, 我们拥有两个世界的最佳读数: 对于非常普通的、实用的语法类来说, 一个严格的Knuth 标准语法的超级设置, 我们可以自动生成解析器, 并且由此产生的读数代码, 以及生成的程序本身, 都非常高效。 这一进步取决于一些新的想法, 包括新编程的调式的调控理学程序; 一个新的语法, 一个不直接的对等的变的变换(“ CP ) ) 。