When extracting a relation of spans (intervals) from a text document, a common practice is to filter out tuples of the relation that are deemed dominated by others. The domination rule is defined as a partial order that varies along different systems and tasks. For example, we may state that a tuple is dominated by tuples which extend it by assigning additional attributes, or assigning larger intervals. The result of filtering the relation would then be the skyline according to this partial order. As this filtering may remove most of the extracted tuples, we study whether we can improve the performance of the extraction by compiling the domination rule into the extractor. To this aim, we introduce the skyline operator for declarative information extraction tasks expressed as document spanners. We show that this operator can be expressed via regular operations when the domination partial order can itself be expressed as a regular spanner, which covers several natural domination rules. Yet, we show that the skyline operator incurs a computational cost (under combined complexity). First, there are cases where the operator requires an exponential blowup on the number of states needed to represent the spanner as a sequential variable-set automaton. Second, the evaluation may become computationally hard. Our analysis more precisely identifies classes of domination rules for which the combined complexity is tractable or intractable.
翻译:当从文本文档中提取跨度(间隔)的关系时,一种常见的做法是过滤掉被认为被其他元组支配的元组。支配规则在不同的系统和任务中定义为偏序关系。例如,我们可以声明一个元组被比其他元组更大的间隔支配,或者附加其他属性。过滤关系的结果将是按照这个偏序关系的天际线。由于这种过滤可能会去除大多数提取的元组,因此我们研究是否可以通过将支配规则编译到提取器中来改进提取的性能。为此,我们引入了声明信息提取任务的天际线运算符。我们证明,当支配偏序关系本身可以表示为正则跨度器时,该运算符可以通过正则操作表达,这包括几种自然的支配规则。然而,我们发现天际线运算符会产生计算成本(结合复杂性)。首先,在顺序变量集自动机中表示跨度器所需的状态数量可能会出现指数级爆炸。其次,评估可能变得计算复杂。我们的分析更明确地确定了支配规则的类别,其中的结合复杂度是可处理或难处理的。