Many challenging problems in modern applications amount to finding relevant results from an enormous output space of potential candidates. The size of the output space for these problems can range from millions to billions. Moreover, training data is often limited for many of the so-called ``long-tail'' of items in the output space. Given the inherent paucity of training data for most of the items in the output space, developing machine learned models that perform well for spaces of this size is challenging. Fortunately, items in the output space are often correlated thereby presenting an opportunity to alleviate the data sparsity issue. In this paper, we propose the Prediction for Enormous and Correlated Output Spaces (PECOS) framework, a versatile and modular machine learning framework for solving prediction problems for very large output spaces, and apply it to the eXtreme Multilabel Ranking (XMR) problem: given an input instance, find and rank the most relevant items from an enormous but fixed and finite output space. PECOS is a three-phase framework: (i) in the first phase, PECOS organizes the output space using a semantic indexing scheme, (ii) in the second phase, PECOS uses the indexing to narrow down the output space by orders of magnitude using a machine learned matching scheme, and (iii) in the third phase, PECOS ranks the matched items using a final ranking scheme. The versatility and modularity of PECOS allows for easy plug-and-play of various choices for the indexing, matching, and ranking phases. On a dataset where the output space is of size 2.8 million, PECOS with a neural matcher results in a 10% increase in precision@1 (from 46% to 51.2%) over PECOS with a recursive linear matcher but takes 265x more time to train. We also develop fast real time inference procedures; for example, inference takes less than 10 milliseconds on the data set with 2.8 million labels.
翻译:现代应用中的许多挑战性问题都在于从潜在候选人的巨大输出空间中找到相关结果。 这些问题的输出空间大小可能从百万至数十亿不等。 此外, 输出空间中许多所谓的“ 长尾” 项目的培训数据往往有限。 鉴于输出空间中大多数项目的培训数据固有的稀缺性, 开发机读模型对于如此大小的空间效果良好是具有挑战性的。 幸运的是, 产出空间中的项目往往相互关联, 从而提供了一个减轻数据偏差问题的机会。 在本文中, 我们提议为超额和相联的输出空间空间( PECOS) 框架, 一个用于解决非常大输出空间的预测问题的多功能和模块化机器学习框架, 并将其应用到 eXtreme 多标签排序( XMRMR) 问题: 在一个输入实例中, 找到和排序最相关的项目, 从一个巨大但固定且有限的输出空间空间空间空间空间空间空间空间空间空间空间。 PEOS 10级框架是一个三阶段:(i) 但是在第一阶段, PECOS 组织输出空间空间空间空间最后索引级中, 使用一个直径直径直线时间级系统, (ii) 在S 级S 级中, 级阶段里, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级级级级中, 级级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级中, 级级中, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级, 级