We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN). For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels. Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels. On five topic classification benchmarks, we improve on various unsupervised baselines by a large margin. In datasets with relatively few and balanced outcome classes, DocSCAN approaches the performance of supervised classification. The method fails for other types of classification, such as sentiment analysis, pointing to important conceptual and practical differences between classifying images and texts.
翻译:我们采用了完全不受监督的文本分类方法DocScAN, 这是一种完全不受监督的文本分类方法, 采用近邻群集( ScAN) 来进行语义分类。 对于每份文件, 我们从一个大型的预先培训语言模型中获取语义信息矢量。 类似文件有接近的矢量, 所以在代表空间的邻居往往共用主题标签。 我们的可学习集群方法使用相邻数据点作为学习信号。 拟议的方法学会在不提供地面真相标签的情况下为整个数据集分配等级。 在五个专题分类基准上, 我们大大改进了各种未经监督的基线。 在数据集中, 相对较少且平衡的结果分类, DocScCAN 接近了受监督分类的绩效。 方法在诸如情绪分析等其他分类类型中失败, 指出图像和文本分类之间的重要概念和实际差异 。