We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones.
翻译:我们建议一种基于语法的新语言,用于定义文档(文本)的信息提取器。这种语言以经过仔细研究的文档显示器框架为基础,用于从文本中提取结构化数据。虽然以前研究过的文件显示器格式主要基于常规表达式,但我们使用无上下文语法的扩展,称为{extractaction gragramars},以定义新的无上下文穿行器类别。提取语法只是无上下文的语法扩展,变量可以捕捉文档的间隔位置,即跨度。虽然常规表达法对于象征和标记有效,但无上下文的语法对于捕捉结构属性也有效。事实上,我们显示无上下文的穿行器比正常的对口语法更加明确。我们说明了我们新阶级的外观力量,并展示了一种可以捕捉它的按下向下方图像模型。我们表明,提取语法可以用多数值的复杂性来评估。然而,多式语法的程度取决于查询和标记,而每次连续的顺序分析结果,我们提出一个连续的顺序分析,在连续的顺序分析后,我们提出一个不重复的顺序分析。