Extreme classification (XC) involves predicting over large numbers of classes (thousands to millions), with real-world applications like news article classification and e-commerce product tagging. The zero-shot version of this task requires generalization to novel classes without additional supervision. In this paper, we develop SemSup-XC, a model that achieves state-of-the-art zero-shot and few-shot performance on three XC datasets derived from legal, e-commerce, and Wikipedia data. To develop SemSup-XC, we use automatically collected semantic class descriptions to represent classes and facilitate generalization through a novel hybrid matching module that matches input instances to class descriptions using a combination of semantic and lexical similarity. Trained with contrastive learning, SemSup-XC significantly outperforms baselines and establishes state-of-the-art performance on all three datasets considered, gaining up to 12 precision points on zero-shot and more than 10 precision points on one-shot tests, with similar gains for recall@10. Our ablation studies highlight the relative importance of our hybrid matching module and automatically collected class descriptions.
翻译:极端分类( XC) 涉及预测大量类( 千至百万), 包括新闻文章分类和电子商务产品标记等真实世界应用。 任务零点版本要求在没有额外监督的情况下对小类进行概括化。 在本文中, 我们开发了SemSup- XC模型, 这个模型可以实现法律、 电子商务和维基百科数据产生的三套XC数据的最新零点和短点性能。 为了开发 SemSup- XC, 我们使用自动收集的语义类描述来代表各类, 并通过一个新型混合匹配模块, 将输入实例与使用语义和词汇相似性组合的类描述相匹配, 从而便利概括化。 我们的对比研究强调我们混合匹配模块的相对重要性, 并自动收集类描述 。