This paper reviews the development of Chinese word segmentation (CWS) in the most recent decade, 2007-2017. Special attention was paid to the deep learning technologies that has already permeated into most areas of natural language processing (NLP). The basic view we have arrived at is that compared to traditional supervised learning methods, neural network based methods have not shown any superior performance. The most critical challenge still lies on balancing of recognition of in-vocabulary (IV) and out-of-vocabulary (OOV) words. However, as neural models have potentials to capture the essential linguistic structure of natural language, we are optimistic about significant progresses may arrive in the near future.
翻译:本文件回顾了中国文字分割(CWS)近十年(2007-2017年)的发展,特别关注已经渗透到大多数自然语言处理领域的深层学习技术(NLP),我们得出的基本观点是,与传统监管的学习方法相比,以神经网络为基础的方法没有表现出任何优异的性能,最关键的挑战仍然是平衡对词汇(四)和词汇外语言(OOOV)的承认,然而,由于神经模型具有捕捉自然语言基本语言结构的潜力,我们对近期内可能取得的重大进展感到乐观。