Boundary information is critical for various Chinese language processing tasks, such as word segmentation, part-of-speech tagging, and named entity recognition. Previous studies usually resorted to the use of a high-quality external lexicon, where lexicon items can offer explicit boundary information. However, to ensure the quality of the lexicon, great human effort is always necessary, which has been generally ignored. In this work, we suggest unsupervised statistical boundary information instead, and propose an architecture to encode the information directly into pre-trained language models, resulting in Boundary-Aware BERT (BABERT). We apply BABERT for feature induction of Chinese sequence labeling tasks. Experimental results on ten benchmarks of Chinese sequence labeling demonstrate that BABERT can provide consistent improvements on all datasets. In addition, our method can complement previous supervised lexicon exploration, where further improvements can be achieved when integrated with external lexicon information.
翻译:边界信息对于各种中文处理任务至关重要,如字分割、部分语音标记和名称实体识别等。以前的研究通常使用高质量的外部词汇,其中词汇项目可以提供明确的边界信息。然而,为了确保词汇的质量,人类总是需要巨大的努力,而这种努力通常被忽视。在这项工作中,我们建议不经监督的统计边界信息取而代之,并提议一种结构,将信息直接编码为预先培训的语言模式,从而形成边界-软件BERT(BABERT) 。我们应用BABERT来为中国序列标识任务进行特征介绍。中国序列标签的10个基准的实验结果表明,BABERT可以对所有数据集提供一致的改进。此外,我们的方法可以补充以往监督的词汇勘探,如果与外部词汇信息相结合,可以实现进一步的改进。