Although current CCG supertaggers achieve high accuracy on the standard WSJ test set, few systems make use of the categories' internal structure that will drive the syntactic derivation during parsing. The tagset is traditionally truncated, discarding the many rare and complex category types in the long tail. However, supertags are themselves trees. Rather than give up on rare tags, we investigate constructive models that account for this internal structure, including novel methods for tree-structured prediction. Our best tagger is capable of recovering a sizeable fraction of the long-tail supertags and even generates CCG categories that have never been seen in training, while approximating the prior state of the art in overall tag accuracy with fewer parameters. We further investigate how well different approaches generalize to out-of-domain evaluation sets.
翻译:尽管目前CCG超级标签在标准 WSJ 测试集上达到了很高的精确度,但很少有系统能利用这些分类的内部结构来驱动在剖析期间的合成衍生。 标签集传统上被缩短, 丢弃了长尾中许多稀有和复杂的分类类型。 但是, 超级标签本身是树。 我们不是放弃稀有标签, 而是调查解释这一内部结构的建设性模型, 包括树结构预测的新方法。 我们最好的标签组能够回收大量长尾超级标签, 甚至生成在训练中从未见过的CCG 分类, 同时在总体标签准确性上接近艺术的先前状态, 且参数更少。 我们进一步调查一般地将外部评估设定为不同方法的方式如何。