Naturally-occurring bracketings, such as answer fragments to natural language questions and hyperlinks on webpages, can reflect human syntactic intuition regarding phrasal boundaries. Their availability and approximate correspondence to syntax make them appealing as distant information sources to incorporate into unsupervised constituency parsing. But they are noisy and incomplete; to address this challenge, we develop a partial-brackets-aware structured ramp loss in learning. Experiments demonstrate that our distantly-supervised models trained on naturally-occurring bracketing data are more accurate in inducing syntactic structures than competing unsupervised systems. On the English WSJ corpus, our models achieve an unlabeled F1 score of 68.9 for constituency parsing.
翻译:自然产生的分类法,例如自然语言问题的解答和网页上的超文本链接等,可以反映人类对语系界限的合成直觉。这些分类法的可用性和大致对应语法的对应性使得它们作为遥远的信息来源吸引了它们融入不受监督的选区划分。但是,这些分类法是吵闹和不完整的;为了应对这一挑战,我们在学习过程中形成了一个部分带宽的、意识到自然生成的括号数据的结构性坡道损失。 实验表明,在引导合成结构方面,我们经过远程监督的模型比相互竞争的不受监督的系统更准确。 在英语的WSJ文集中,我们的模型在选区划分方面达到了68.9分的无标签F1分。