During the past decade, neural network models have made tremendous progress on in-domain semantic role labeling (SRL). However, performance drops dramatically under the out-of-domain setting. In order to facilitate research on cross-domain SRL, this paper presents MuCPAD, a multi-domain Chinese predicate-argument dataset, which consists of 30,897 sentences and 92,051 predicates from six different domains. MuCPAD exhibits three important features. 1) Based on a frame-free annotation methodology, we avoid writing complex frames for new predicates. 2) We explicitly annotate omitted core arguments to recover more complete semantic structure, considering that omission of content words is ubiquitous in multi-domain Chinese texts. 3) We compile 53 pages of annotation guidelines and adopt strict double annotation for improving data quality. This paper describes in detail the annotation methodology and annotation process of MuCPAD, and presents in-depth data analysis. We also give benchmark results on cross-domain SRL based on MuCPAD.
翻译:在过去十年中,神经网络模型在内部语义作用标签方面取得了巨大进展。然而,在外域设置下,性能显著下降。为了便利对跨域SRL进行研究,本文件介绍了中华多面的中国上游参数数据集MuCPAD, 这是一个由30,897个判决和来自六个不同领域的92,051个上游组成的多面中国上游参数数据集。中巴发委会展示了三个重要特征。 1)根据无框架说明方法,我们避免为新上游绘制复杂的框架。 2)我们明确说明遗漏的核心参数,以恢复更完整的语义结构,考虑到在多面中文文本中遗漏内容词是无处不在的。3)我们汇编了53页注解指南,并采用了严格的双重注解,以提高数据质量。本文详细介绍了《中巴发委》的注解方法和注过程,并介绍了深入的数据分析。我们还根据《中华发》对跨面SRL提供了基准结果。