For over thirty years, researchers have developed and analyzed methods for latent tree induction as an approach for unsupervised syntactic parsing. Nonetheless, modern systems still do not perform well enough compared to their supervised counterparts to have any practical use as structural annotation of text. In this work, we present a technique that uses distant supervision in the form of span constraints (i.e. phrase bracketing) to improve performance in unsupervised constituency parsing. Using a relatively small number of span constraints we can substantially improve the output from DIORA, an already competitive unsupervised parsing system. Compared with full parse tree annotation, span constraints can be acquired with minimal effort, such as with a lexicon derived from Wikipedia, to find exact text matches. Our experiments show span constraints based on entities improves constituency parsing on English WSJ Penn Treebank by more than 5 F1. Furthermore, our method extends to any domain where span constraints are easily attainable, and as a case study we demonstrate its effectiveness by parsing biomedical text from the CRAFT dataset.
翻译:三十多年来,研究人员已经开发并分析了潜树诱导方法,作为不受监督的合成分析的一种方法,然而,现代系统的运作仍然不如其监督的对口系统,因此在结构文字说明方面没有任何实际用途。在这项工作中,我们展示了一种技术,以跨度限制(即括号)的形式利用遥远的监督来提高不受监督的选区分析的绩效。我们使用相对较少的跨度限制可以大幅提高DIORA的产出。DIORA是一个已经具有竞争力的未经监督的分解系统。与完全的分解树注相比,可以尽量以最小的努力(例如从维基百科获得的词汇)获得跨度限制,以找到精确的文本匹配。我们的实验显示,基于实体改善选区对英国WSJ Penn Treebank的划分超过5F1的制约。 此外,我们的方法延伸到可以容易实现跨度限制的任何领域,作为案例研究,我们通过从CRAFT数据集中提取生物医学文本来证明其有效性。