在大型产出空间的索引和搜索端到端学习 (End-to-End Learning to Index and Search in Large Output Spaces)

Extreme multi-label classification (XMC) is a popular framework for solving many real-world problems that require accurate prediction from a very large number of potential output choices. A popular approach for dealing with the large label space is to arrange the labels into a shallow tree-based index and then learn an ML model to efficiently search this index via beam search. Existing methods initialize the tree index by clustering the label space into a few mutually exclusive clusters based on pre-defined features and keep it fixed throughout the training procedure. This approach results in a sub-optimal indexing structure over the label space and limits the search performance to the quality of choices made during the initialization of the index. In this paper, we propose a novel method ELIAS which relaxes the tree-based index to a specialized weighted graph-based index which is learned end-to-end with the final task objective. More specifically, ELIAS models the discrete cluster-to-label assignments in the existing tree-based index as soft learnable parameters that are learned jointly with the rest of the ML model. ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels. In particular, ELIAS can be up to 2.5% better at precision@1 and up to 4% better at recall@100 than existing XMC methods. A PyTorch implementation of ELIAS along with other resources is available at https://github.com/nilesh2797/ELIAS.

翻译：极端多标签分类( XMC) 是解决许多现实世界问题的流行框架, 需要从大量的潜在产出选择中准确预测。处理大标签空间的流行方法是将标签安排成浅树基指数, 然后学习ML模型, 以便通过光束搜索有效搜索该指数。现有方法将标签空间分组成几个基于预定义特性的相互排斥的集群, 并在整个培训程序中固定。这种方法的结果是在标签空间上形成一个亚最佳的索引结构, 并将搜索性能限制在指数初始化期间的选择质量上。在本文中, 我们提出一种新的方法 ELIAS 将基于树的索引设为浅树基指数, 并随后学习一个专门加权的图形化指数, 以便与最终任务目标一起学习端端到端。更具体地, ELIAS将现有树基指数中的离散的集群到标签任务作为可软化的参数, 与 ML 模型的其余部分一起学习。 ELIAS 实现最先进的状态- 性能性性性性性性性性性性性性性性性能, 在多个IAS- IMA1 特定的EL 和特定的精确性电子- 级级级级级级级级级和现有级级级级级级级级级级级级级级至级级级级级级级级级至级级级级级级级级级级至级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级