Natural language processing is a fast-growing field of artificial intelligence. Since the Transformer was introduced by Google in 2017, a large number of language models such as BERT, GPT, and ELMo have been inspired by this architecture. These models were trained on huge datasets and achieved state-of-the-art results on natural language understanding. However, fine-tuning a pre-trained language model on much smaller datasets for downstream tasks requires a carefully-designed pipeline to mitigate problems of the datasets such as lack of training data and imbalanced data. In this paper, we propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection. We first tune the PhoBERT on our dataset by re-training the model on the Masked Language Model task; then, we employ its encoder for text classification. In order to preserve pre-trained weights while learning new feature representations, we further utilize different training techniques: layer freezing, block-wise learning rate, and label smoothing. Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.
翻译:自然语言处理是人工智能的一个快速增长的领域。 自谷歌于2017年引入变换器以来,许多语言模型,如BERT、GPT和ELMO,都受到这一架构的启发。这些模型在庞大的数据集方面接受了培训,在自然语言理解方面获得了最先进的成果。然而,在为下游任务对小得多的数据集进行微调,对预先培训的语言模型进行微调,这需要精心设计一个管道,以缓解数据集的问题,如培训数据缺乏和数据不平衡。在本文中,我们建议建立一个管道,将通用的RoBERTA语言模型调整到具体的文本分类任务:越南仇恨言论探测。我们首先通过对遮盖语言模型进行再培训,将PhoBERT的数据集调整到我们的数据集上;然后,我们用它的编码器进行文字分类。为了在学习新特征表的同时保留预先培训的重量,我们进一步使用不同的培训技术:层冻结、阻断学习率率和标签。我们的实验证明我们拟议的输送管道模型大大提升了绩效,实现了对越南1年1年新一轮的“国位”的“探查”。