Chinese word segmentation (CWS) is a fundamental step of Chinese natural language processing. In this paper, we build a new toolkit, named PKUSEG, for multi-domain word segmentation. Unlike existing single-model toolkits, PKUSEG targets multi-domain word segmentation and provides separate models for different domains, such as web, medicine, and tourism. Besides, due to the lack of labeled data in many domains, we propose a domain adaptation paradigm to introduce cross-domain semantic knowledge via a translation system. Through this method, we generate synthetic data using a large amount of unlabeled data in the target domain and then obtain a word segmentation model for the target domain. We also further refine the performance of the default model with the help of synthetic data. Experiments show that PKUSEG achieves high performance on multiple domains. The new toolkit also supports POS tagging and model training to adapt to various application scenarios. The toolkit is now freely and publicly available for the usage of research and industry.
翻译:中文文字分割( CWS) 是中国自然语言处理的基本步骤 。 在本文中, 我们为多字分割构建了名为 PKUSEG 的新工具包。 与现有的单一模型工具包不同, PKUSEG 的目标是多字分割, 并为不同领域提供不同的模型, 如网络、 医学和旅游。 此外, 由于许多领域缺少标签数据, 我们提议了一个域性适应模式, 通过翻译系统引入跨域语义学知识。 通过这个方法, 我们生成了合成数据, 在目标域使用大量未贴标签的数据, 然后获得了目标域的字分割模型。 我们还在合成数据的帮助下进一步完善了默认模型的性能。 实验显示 PKUSEG 在多个领域取得了高绩效。 新工具包还支持了POS 标记和模型培训, 以适应各种应用情景。 该工具包现在可供研究和产业使用, 可以自由公开使用。