Distance-based unsupervised text classification is a method within text classification that leverages the semantic similarity between a label and a text to determine label relevance. This method provides numerous benefits, including fast inference and adaptability to expanding label sets, as opposed to zero-shot, few-shot, and fine-tuned neural networks that require re-training in such cases. In multi-label distance-based classification and information retrieval algorithms, thresholds are required to determine whether a text instance is "similar" to a label or query. Similarity between a text and label is determined in a dense embedding space, usually generated by state-of-the-art sentence encoders. Multi-label classification complicates matters, as a text instance can have multiple true labels, unlike in multi-class or binary classification, where each instance is assigned only one label. We expand upon previous literature on this underexplored topic by thoroughly examining and evaluating the ability of sentence encoders to perform distance-based classification. First, we perform an exploratory study to verify whether the semantic relationships between texts and labels vary across models, datasets, and label sets by conducting experiments on a diverse collection of realistic multi-label text classification (MLTC) datasets. We find that similarity distributions show statistically significant differences across models, datasets and even label sets. We propose a novel method for optimizing label-specific thresholds using a validation set. Our label-specific thresholding method achieves an average improvement of 46% over normalized 0.5 thresholding and outperforms uniform thresholding approaches from previous work by an average of 14%. Additionally, the method demonstrates strong performance even with limited labeled examples.
翻译:基于距离的无监督文本分类是文本分类领域的一种方法,它利用标签与文本之间的语义相似性来确定标签相关性。相较于需要重新训练的零样本、少样本和微调神经网络,该方法具有推理速度快、适应扩展标签集等优势。在多标签基于距离的分类和信息检索算法中,需要设定阈值来判断文本实例是否与标签或查询"相似"。文本与标签之间的相似性在稠密嵌入空间中确定,该空间通常由最先进的句子编码器生成。多标签分类使问题复杂化,因为与多类别或二分类中每个实例仅分配一个标签不同,文本实例可能具有多个真实标签。我们通过深入检验和评估句子编码器执行基于距离分类的能力,拓展了先前关于这一未充分探索主题的研究。首先,我们通过在多样化现实多标签文本分类数据集上进行实验,开展探索性研究以验证文本与标签间的语义关系是否因模型、数据集和标签集而异。研究发现相似度分布在模型、数据集甚至标签集之间均存在统计学显著差异。我们提出了一种利用验证集优化标签特定阈值的新方法。该标签特定阈值方法相较于归一化0.5阈值平均提升46%的性能,并较先前工作的均匀阈值方法平均提升14%。此外,即使在有限标注样本条件下,该方法仍展现出强劲性能。