Diverse Natural Language Processing tasks employ constituency parsing to understand the syntactic structure of a sentence according to a phrase structure grammar. Many state-of-the-art constituency parsers are proposed, but they may provide different results for the same sentences, especially for corpora outside their training domains. This paper adopts the truth discovery idea to aggregate constituency parse trees from different parsers by estimating their reliability in the absence of ground truth. Our goal is to consistently obtain high-quality aggregated constituency parse trees. We formulate the constituency parse tree aggregation problem in two steps, structure aggregation and constituent label aggregation. Specifically, we propose the first truth discovery solution for tree structures by minimizing the weighted sum of Robinson-Foulds (RF) distances, a classic symmetric distance metric between two trees. Extensive experiments are conducted on benchmark datasets in different languages and domains. The experimental results show that our method, CPTAM, outperforms the state-of-the-art aggregation baselines. We also demonstrate that the weights estimated by CPTAM can adequately evaluate constituency parsers in the absence of ground truth.
翻译:多种自然语言处理任务采用选区分类方法,根据语法结构来理解句子的综合结构。许多最先进的选区分析员提出了许多最先进的选区分析员的建议,但它们可以为相同的句子提供不同的结果,特别是培训领域以外的公司。本文采用真相发现理念,在没有地面真相的情况下通过估计不同选区的可靠性,从不同选区分析树木。我们的目标是不断获得高质量的综合选区分析树。我们从两个步骤,即结构汇总和组成标签汇总中,将选区分析的树群问题分为两步。具体地说,我们提出了第一个了解树群真相的办法,即将鲁滨逊-福德(RF)距离的加权总和最小化,这是两棵树之间典型的对称距离指标。对不同语言和领域的基准数据集进行了广泛的实验。实验结果表明,我们的方法(CPTAM)超越了最先进的集合基线。我们还表明,CPTAM估计的重量可以在没有地面真相的情况下对选区分析员进行适当的评估。