Deep language models have achieved remarkable success in the NLP domain. The standard way to train a deep language model is to employ unsupervised learning from scratch on a large unlabeled corpus. However, such large corpora are only available for widely-adopted and high-resource languages and domains. This study presents the first deep language model, DPRK-BERT, for the DPRK language. We achieve this by compiling the first unlabeled corpus for the DPRK language and fine-tuning a preexisting the ROK language model. We compare the proposed model with existing approaches and show significant improvements on two DPRK datasets. We also present a cross-lingual version of this model which yields better generalization across the two Korean languages. Finally, we provide various NLP tools related to the DPRK language that would foster future research.
翻译:深语言模式在朝鲜语言领域取得了显著成功,培训深语言模式的标准方式是,在大型未加标记的文体上采用未经监督的从零开始学习的办法,然而,这种大型公司只供广泛采用和高资源语言和领域使用,这是朝鲜语言的第一个深语言模式,即朝鲜-不丹语,我们为此汇编了朝鲜语言第一个未加标记的文体,并微调了先前存在的韩语模式。我们将拟议模式与现有模式进行比较,并展示了两套朝鲜数据集的重大改进。我们还提供了这一模式的跨语言版本,使两种朝鲜语言更加普遍化。最后,我们提供了与朝鲜语言相关的各种朝鲜语言的民族语言工具,以促进未来的研究。