Candidate generation is the first stage in recommendation systems, where a light-weight system is used to retrieve potentially relevant items for an input user. These candidate items are then ranked and pruned in later stages of recommender systems using a more complex ranking model. Since candidate generation is the top of the recommendation funnel, it is important to retrieve a high-recall candidate set to feed into downstream ranking models. A common approach for candidate generation is to leverage approximate nearest neighbor (ANN) search from a single dense query embedding; however, this approach this can yield a low-diversity result set with many near duplicates. As users often have multiple interests, candidate retrieval should ideally return a diverse set of candidates reflective of the user's multiple interests. To this end, we introduce kNN-Embed, a general approach to improving diversity in dense ANN-based retrieval. kNN-Embed represents each user as a smoothed mixture over learned item clusters that represent distinct `interests' of the user. By querying each of a user's mixture component in proportion to their mixture weights, we retrieve a high-diversity set of candidates reflecting elements from each of a user's interests. We experimentally compare kNN-Embed to standard ANN candidate retrieval, and show significant improvements in overall recall and improved diversity across three datasets. Accompanying this work, we open source a large Twitter follow-graph dataset, to spur further research in graph-mining and representation learning for recommender systems.
翻译:候选人生成是推荐系统的第一个阶段, 使用轻量级系统为输入用户检索潜在相关项目。 这些候选项目随后被排序, 并使用更复杂的排序模式在推荐者系统后期阶段使用更复杂的排序模式进行剪切。 由于候选人生成是建议漏斗的顶端, 重要的是要检索一个高回调候选人, 以输入下游排名模式。 候选人生成的通用方法是利用一个单一密集的查询嵌入器, 来利用近邻( ANN) 搜索; 然而, 这种方法可以产生一个低代表比例的低多样性结果组合, 并有许多近似重复。 由于用户通常有多种兴趣, 候选人检索最好能返回一组反映用户多重利益的不同候选人。 为此, 我们引入 kNNN- Embed, 这是在密集的ANNNE 检索中改进多样性, 代表每个用户代表一个轻松的学习项目组群, 代表用户的“ 兴趣” ; 然而, 通过查询每个用户的开放的地理图组组合部分, 以与其混合重量成比例的比例, 我们从一个高端的一组候选人中检索一组反映用户多重数据检索系统 。