The pre-trained language model is trained on large-scale unlabeled text and can achieve state-of-the-art results in many different downstream tasks. However, the current pre-trained language model is mainly concentrated in the Chinese and English fields. For low resource language such as Tibetan, there is lack of a monolingual pre-trained model. To promote the development of Tibetan natural language processing tasks, this paper collects the large-scale training data from Tibetan websites and constructs a vocabulary that can cover 99.95$\%$ of the words in the corpus by using Sentencepiece. Then, we train the Tibetan monolingual pre-trained language model named TiBERT on the data and vocabulary. Finally, we apply TiBERT to the downstream tasks of text classification and question generation, and compare it with classic models and multilingual pre-trained models, the experimental results show that TiBERT can achieve the best performance. Our model is published in http://tibert.cmli-nlp.com/
翻译:预培训语言模式在大型无标签文本方面经过培训,可以在许多不同的下游任务中取得最先进的成果。然而,目前的预培训语言模式主要集中于中英领域。对于藏等低资源语言,缺乏单一语言的预培训模式。为促进藏文自然语言处理任务的发展,本文收集藏文网站的大规模培训数据,并用句子构建一个词汇,可以覆盖文中99.95美元字元。然后,我们培训西藏单语预培训语言模式,名为TiBERT的数据和词汇。最后,我们将TiBERT应用于文字分类和问题生成的下游任务,并将其与经典模式和多语种预培训模式进行比较,实验结果表明,TiBERT能够取得最佳业绩。我们的模型公布在http://tibert.cmli-nlp.com/上。