The emergence of pre-trained language models (PLMs) has shown great success in many Natural Language Processing (NLP) tasks including text classification. Due to the minimal to no feature engineering required when using these models, PLMs are becoming the de facto choice for any NLP task. However, for domain-specific corpora (e.g., financial, legal, and industrial), fine-tuning a pre-trained model for a specific task has shown to provide a performance improvement. In this paper, we compare the performance of four different PLMs on three public domain-free datasets and a real-world dataset containing domain-specific words, against a simple SVM linear classifier with TFIDF vectorized text. The experimental results on the four datasets show that using PLMs, even fine-tuned, do not provide significant gain over the linear SVM classifier. Hence, we recommend that for text classification tasks, traditional SVM along with careful feature engineering can pro-vide a cheaper and superior performance than PLMs.
翻译:预先培训的语言模式(PLM)的出现在许多自然语言处理(NLP)任务(包括文本分类)中显示出了巨大的成功。由于使用这些模式时所需的微小甚至没有特色工程,PLM正在成为任何自然处理(NLP)任务的实际选择。然而,对于特定领域的公司(例如金融、法律和工业)来说,为某项具体任务微调预先培训的模式表明可以提供业绩改进。在本文中,我们比较了四个不同的PLM在三个公共域免数据集和包含特定域词的真实世界数据集中的性能,与一个带有TFIDF矢量化文本的简单的SVM线性分类器相比。这四个数据集的实验结果表明,使用PLMs(甚至微调)并不比线性SVM分类器带来重大收益。因此,我们建议,对于文本分类任务,传统的SVM和谨慎的特征工程可以比PLMs更便宜、更优秀的性能。