Chinese word segmentation (CWS) models have achieved very high performance when the training data is sufficient and in-domain. However, the performance drops drastically when shifting to cross-domain and low-resource scenarios due to data sparseness issues. Considering that constructing large-scale manually annotated data is time-consuming and labor-intensive, in this work, we for the first time propose to mine word boundary information from pauses in speech to efficiently obtain large-scale CWS naturally annotated data. We present a simple yet effective complete-then-train method to utilize these natural annotations from speech for CWS model training. Extensive experiments demonstrate that the CWS performance in cross-domain and low-resource scenarios can be significantly improved by leveraging our naturally annotated data extracted from speech.
翻译:当培训数据充足且内容充足时,中国的文字分隔模型(CWS)的性能非常高。然而,由于数据稀少问题,在转向跨部和低资源情景时,性能急剧下降。考虑到大规模人工附加说明数据构建既耗时又耗费人力,我们首次提议从暂停发言中埋设词界信息,以有效获取大规模CWS自然附加说明数据。我们提出了一个简单而有效的完整培训方法,以利用CWS模式培训演讲中的自然说明。广泛的实验表明,利用我们从演讲中提取的自然附加说明数据,CWS在跨部和低资源情景中的性能可以大大改进。