评价不受监督的文本分类:零射和以相似性为基础的方法 (Evaluating Unsupervised Text Classification: Zero-shot and Similarity-based Approaches)

Text classification of unseen classes is a challenging Natural Language Processing task and is mainly attempted using two different types of approaches. Similarity-based approaches attempt to classify instances based on similarities between text document representations and class description representations. Zero-shot text classification approaches aim to generalize knowledge gained from a training task by assigning appropriate labels of unknown classes to text documents. Although existing studies have already investigated individual approaches to these categories, the experiments in literature do not provide a consistent comparison. This paper addresses this gap by conducting a systematic evaluation of different similarity-based and zero-shot approaches for text classification of unseen classes. Different state-of-the-art approaches are benchmarked on four text classification datasets, including a new dataset from the medical domain. Additionally, novel SimCSE and SBERT-based baselines are proposed, as other baselines used in existing work yield weak classification results and are easily outperformed. Finally, the novel similarity-based Lbl2TransformerVec approach is presented, which outperforms previous state-of-the-art approaches in unsupervised text classification. Our experiments show that similarity-based approaches significantly outperform zero-shot approaches in most cases. Additionally, using SimCSE or SBERT embeddings instead of simpler text representations increases similarity-based classification results even further.

翻译：以类似方式尝试根据文本文件说明和类别说明的相似性对实例进行分类; 零点文本分类办法的目的是通过为文本文件指定适当的未知类别标签来普及从培训任务中获得的知识; 虽然现有研究已经对这些类别中的个别方法进行了调查,但文献实验并没有提供一致的比较。本文件通过系统评估不同类似性和零点化方法来弥补这一差距,用于对隐性分类的文本分类。不同的先进方法以四个文本分类数据集为基准,包括医疗领域的新数据集为基准。此外,还提出了新的SimCSE和SBERT基准,因为现有工作使用的其他基线产生薄弱的分类结果,而且很容易完成。最后,介绍了基于新颖类似性的LBlLBL2 TransferectVec 方法,这在未受监督的文本分类中比以往的状态和零点化方法要好得多。我们的实验表明,即使基于类似性的SIMSE和SERT的更简单化方法也大大超越了在最不受监督的文本分类中采用更简单化的版本。