Transformer-based machine learning models have become an essential tool for many natural language processing (NLP) tasks since the introduction of the method. A common objective of these projects is to classify text data. Classification models are often extended to a different topic and/or time period. In these situations, deciding how long a classification is suitable for and when it is worth re-training our model is difficult. This paper compares different approaches to fine-tune a BERT model for a long-running classification task. We use data from different periods to fine-tune our original BERT model, and we also measure how a second round of annotation could boost the classification quality. Our corpus contains over 8 million comments on COVID-19 vaccination in Hungary posted between September 2020 and December 2021. Our results show that the best solution is using all available unlabeled comments to fine-tune a model. It is not advisable to focus only on comments containing words that our model has not encountered before; a more efficient solution is randomly sample comments from the new period. Fine-tuning does not prevent the model from losing performance but merely slows it down. In a rapidly changing linguistic environment, it is not possible to maintain model performance without regularly annotating new text.
翻译:自采用该方法以来,基于变换器的机器学习模型已成为许多自然语言处理(NLP)任务的基本工具。这些项目的共同目标之一是对文本数据进行分类。分类模型通常会扩大到不同的专题和/或时间段。在这种情况下,决定分类是否适合和何时值得对模型进行再培训是困难的。本文比较了为长期的分类任务微调BERT模型的不同方法。我们使用不同时期的数据来微调我们最初的BERT模型,我们还测量了第二轮注解如何提高分类质量。我们的资料包含800多万对2020年9月至2021年12月期间匈牙利COVID-19疫苗接种的评论。我们的结果显示,最佳的解决方案是使用所有现有的无标签评论来微调模型。不宜只侧重于含有我们模型以前未曾遇到的词的评论;一个更高效的解决方案是新时期的随机抽样评论。微调不会阻止模型的失败,而只是放慢了分类质量。在快速变化的语言环境中,无法定期保持新的文本。