Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data, Zero-Shot Learning, Topic Modeling, Transformers.
翻译:传统文本分类方法往往需要大量标签数据,很难获得这些数据,特别是在限制领域或不太广泛的语言中。缺乏标签数据导致低资源方法的上升,而低资源方法假定自然语言处理中的数据可获得性能较低。其中突出的是零点学习,它包括学习一个分类器而没有任何先前标签数据。采用这种方法报告的最佳结果使用诸如变换器等语言模型,但分为两个问题:高执行时间和无法处理作为输入的长点文字。本文提出了一个新的模型,即ZeroBERTo,它利用一个不受监督的组合步骤在分类任务之前获得压缩的数据表示。我们显示,ZeroBERTo在长期投入和较短的执行时间方面表现更好,在FolhaUOL数据集的F1评分中比XLM-R高出约12%。关键词:低源NLP、无标签数据、零点学习、专题模型、变换者。