Transformer-based language models, more specifically BERT-based architectures have achieved state-of-the-art performance in many downstream tasks. However, for a relatively low-resource language such as Thai, the choices of models are limited to training a BERT-based model based on a much smaller dataset or finetuning multi-lingual models, both of which yield suboptimal downstream performance. Moreover, large-scale multi-lingual pretraining does not take into account language-specific features for Thai. To overcome these limitations, we pretrain a language model based on RoBERTa-base architecture on a large, deduplicated, cleaned training set (78GB in total size), curated from diverse domains of social media posts, news articles and other publicly available datasets. We apply text processing rules that are specific to Thai most importantly preserving spaces, which are important chunk and sentence boundaries in Thai before subword tokenization. We also experiment with word-level, syllable-level and SentencePiece tokenization with a smaller dataset to explore the effects on tokenization on downstream performance. Our model wangchanberta-base-att-spm-uncased trained on the 78.5GB dataset outperforms strong baselines (NBSVM, CRF and ULMFit) and multi-lingual models (XLMR and mBERT) on both sequence classification and token classification tasks in human-annotated, mono-lingual contexts.
翻译:以变换语言为基础的模型,更具体地说,基于BERT的架构,在许多下游任务中取得了最先进的成绩,然而,对于泰国等相对较低的资源语言,模型的选择仅限于培训基于更小的数据集或微调多语种模型的基于BERT的模型,这两种模型都产生亚于最佳的下游性能。此外,大型多语言的预培训没有考虑到泰国的语言特点。为了克服这些限制,我们预设了一个基于罗贝塔(RoBERTA)数据库结构的语文模型,用于大规模、可复制的、清洁的培训(78GB总规模的78GB),由社会媒体、新闻文章和其他公开提供的数据集等不同领域调整。我们应用了泰国特有的文本处理规则,这些文本处理规则是泰国在子词标注之前的重要块和句界界限。我们还试验了字级、可调级和句式Piece标识,并用一个较小的数据集来探索对下游业绩的标志性效果。我们所培训的模型-光机-B-B-B-BSMAT-S-S-M-M-S-M-S-B-B-B-SIM-B-M-S-S-S-S-S-B-B-M-B-B-B-B-B-B-M-M-M-B-M-B-B-M-B-B-B-B-M-B-B-M-B-B-B-B-M-M-B-B-M-B-B-M-B-B-B-B-B-B-B-B-B-B-M-M-B-B-B-B-B-B-B-B-B-B-B-B-M-B-B-B-B-B-B-B-B-B-M-B-B-B-B-B-M-B-B-B-B-B-M-M-M-B-B-B-B-B-B-M-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-