flap：带有融合词法分析器的确定性解析器 (flap: A Deterministic Parser with Fused Lexing)

Lexers and parsers are typically defined separately and connected by a token stream. This separate definition is important for modularity and reduces the potential for parsing ambiguity. However, materializing tokens as data structures and case-switching on tokens comes with a cost. We show how to fuse separately-defined lexers and parsers, drastically improving performance without compromising modularity or increasing ambiguity. We propose a deterministic variant of Greibach Normal Form that ensures deterministic parsing with a single token of lookahead and makes fusion strikingly simple, and prove that normalizing context free expressions into the deterministic normal form is semantics-preserving. Our staged parser combinator library, flap, provides a standard interface, but generates specialized token-free code that runs two to six times faster than ocamlyacc on a range of benchmarks.

翻译：词法分析器和解析器通常是分别定义的，并通过一个标记流相连。这种分离的定义对于模块化很重要，并减少了解析歧义的可能性。然而，将标记实体化为数据结构并在标记上进行情况切换是有代价的。我们展示了如何融合分别定义的词法分析器和解析器，极大地提高了性能，而不影响模块化或增加歧义。我们提出了 Greibach 标准格式的确定性变体，它确保以单个标记的先行模式进行确定性解析，并使融合变得非常简单，并证明将上下文无关表达式标准化为确定性标准格式是语义保持的。我们的分阶段解析组合库 flap 提供了标准接口，但生成专业的零标记代码，其在一系列基准测试中的运行速度比 ocamlyacc 快两到六倍。

相关内容

词法分析

关注 204

词法分析（英语：lexical analysis）是计算机科学中将字符序列转换为单词（Token）序列的过程。词法分析（lexical analysis）包括汉语分词和词性标注两部分。和大部分西方语言不同，汉语书面语词语之间没有明显的空格标记，文本中的句子以字串的形式出现。因此汉语自然语言处理的首要工作就是要将输入的字串切分为单独的词语，然后在此基础上进行其他更高级的分析，这一步骤称为分词（word segmentation 或tokenization）。除了分词，词性标注也通常认为是词法分析的一部分。给定一个切好词的句子，词性标注的目的是为每一个词赋予一个类别，这个类别称为词性标记（part-of-speech tag），比如，名词（noun）、动词（verb）、形容词（adjective）等。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

【超赞的#C++#速查&信息图】“hacking c++ - Cheat Sheets & Infographics”

专知会员服务

30+阅读 · 2022年3月8日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【CVPR2020-Facebook】从检测到3D目标，FroDO: From Detections to 3D Objects

专知会员服务

33+阅读 · 2020年5月12日