We present fastHan, an open-source toolkit for four basic tasks in Chinese natural language processing: Chinese word segmentation (CWS), Part-of-Speech (POS) tagging, named entity recognition (NER), and dependency parsing. The backbone of fastHan is a multi-task model based on a pruned BERT, which uses the first 8 layers in BERT. We also provide a 4-layer base model compressed from the 8-layer model. The joint-model is trained and evaluated on 13 corpora of four tasks, yielding near state-of-the-art (SOTA) performance in dependency parsing and NER, achieving SOTA performance in CWS and POS. Besides, fastHan's transferability is also strong, performing much better than popular segmentation tools on a non-training corpus. To better meet the need of practical application, we allow users to use their own labeled data to further fine-tune fastHan. In addition to its small size and excellent performance, fastHan is user-friendly. Implemented as a python package, fastHan isolates users from the internal technical details and is convenient to use. The project is released on Github.
翻译:我们提出了快速汉,这是用于中国自然语言处理的四项基本任务的开放源码工具包:中文单词分割(CWS)、部分语音标记(POS),名称实体识别(NER)和依赖性剖析。快速汉的骨干是一个多任务模型,它基于一个使用BERT头8层的修剪 BERT。我们还从8层模式中压缩了一个四层基础模型。联合模型在13个有4个任务的公司进行培训和评价,在依赖性剖析和净化方面产生接近最先进的艺术(SOTA)性能,在CWS和POS中实现SOTA性能。此外,快速汉的可转移性也很强,比非培训系统中的流行分解工具要好得多。为了更好地满足实际应用的需要,我们允许用户使用自己的标签数据进一步微调快速Han。除了其小规模和出色性能外,快速Han还方便用户使用。快速Han在Python软件包中实施,快速Han的软件是方便地将Gian内部用户隔离。