长期提高NLP变压器模型的分类可靠性 (Boosting classification reliability of NLP transformer models in the long run)

Transformer-based machine learning models have become an essential tool for many natural language processing (NLP) tasks since the introduction of the method. A common objective of these projects is to classify text data. Classification models are often extended to a different topic and/or time period. In these situations, deciding how long a classification is suitable for and when it is worth re-training our model is difficult. This paper compares different approaches to fine-tune a BERT model for a long-running classification task. We use data from different periods to fine-tune our original BERT model, and we also measure how a second round of annotation could boost the classification quality. Our corpus contains over 8 million comments on COVID-19 vaccination in Hungary posted between September 2020 and December 2021. Our results show that the best solution is using all available unlabeled comments to fine-tune a model. It is not advisable to focus only on comments containing words that our model has not encountered before; a more efficient solution is randomly sample comments from the new period. Fine-tuning does not prevent the model from losing performance but merely slows it down. In a rapidly changing linguistic environment, it is not possible to maintain model performance without regularly annotating new text.

翻译：自采用该方法以来,基于变换器的机器学习模型已成为许多自然语言处理(NLP)任务的基本工具。这些项目的共同目标之一是对文本数据进行分类。分类模型通常会扩大到不同的专题和/或时间段。在这种情况下,决定分类是否适合和何时值得对模型进行再培训是困难的。本文比较了为长期的分类任务微调BERT模型的不同方法。我们使用不同时期的数据来微调我们最初的BERT模型,我们还测量了第二轮注解如何提高分类质量。我们的资料包含800多万对2020年9月至2021年12月期间匈牙利COVID-19疫苗接种的评论。我们的结果显示,最佳的解决方案是使用所有现有的无标签评论来微调模型。不宜只侧重于含有我们模型以前未曾遇到的词的评论;一个更高效的解决方案是新时期的随机抽样评论。微调不会阻止模型的失败,而只是放慢了分类质量。在快速变化的语言环境中,无法定期保持新的文本。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日