flap: 一个带有融合词法分析器的确定性语法分析器 (flap: A Deterministic Parser with Fused Lexing)

Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.

翻译：词法分析器和语法分析器通常是分别定义的，并由标记流连接。这种单独的定义对于模块化非常重要，并减少了解析歧义的可能性。然而，将标记实现为数据结构并在标记上进行case-switching是有代价的。我们展示了如何融合分别定义的词法分析器和语法分析器，显着提高了性能，而不会损害模块化或增加歧义。我们提出了Greibach正常形式的确定性变体，确保确定性解析单个标记并使融合变得非常简单，并证明将上下文无关表达式归一化为确定性正常形式是保留语义的。我们的分阶段解析组合器库flap提供标准接口，但生成专门的无标记代码，在各种基准测试中运行速度比ocamlyacc快两到六倍。

相关内容

词法分析

关注 204

词法分析（英语：lexical analysis）是计算机科学中将字符序列转换为单词（Token）序列的过程。词法分析（lexical analysis）包括汉语分词和词性标注两部分。和大部分西方语言不同，汉语书面语词语之间没有明显的空格标记，文本中的句子以字串的形式出现。因此汉语自然语言处理的首要工作就是要将输入的字串切分为单独的词语，然后在此基础上进行其他更高级的分析，这一步骤称为分词（word segmentation 或tokenization）。除了分词，词性标注也通常认为是词法分析的一部分。给定一个切好词的句子，词性标注的目的是为每一个词赋予一个类别，这个类别称为词性标记（part-of-speech tag），比如，名词（noun）、动词（verb）、形容词（adjective）等。

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日