可靠的物品采样方法用于推荐系统评估 (Towards Reliable Item Sampling for Recommendation Evaluation)

Since Rendle and Krichene argued that commonly used sampling-based evaluation metrics are "inconsistent" with respect to the global metrics (even in expectation), there have been a few studies on the sampling-based recommender system evaluation. Existing methods try either mapping the sampling-based metrics to their global counterparts or more generally, learning the empirical rank distribution to estimate the top-$K$ metrics. However, despite existing efforts, there is still a lack of rigorous theoretical understanding of the proposed metric estimators, and the basic item sampling also suffers from the "blind spot" issue, i.e., estimation accuracy to recover the top-$K$ metrics when $K$ is small can still be rather substantial. In this paper, we provide an in-depth investigation into these problems and make two innovative contributions. First, we propose a new item-sampling estimator that explicitly optimizes the error with respect to the ground truth, and theoretically highlight its subtle difference against prior work. Second, we propose a new adaptive sampling method which aims to deal with the "blind spot" problem and also demonstrate the expectation-maximization (EM) algorithm can be generalized for such a setting. Our experimental results confirm our statistical analysis and the superiority of the proposed works. This study helps lay the theoretical foundation for adopting item sampling metrics for recommendation evaluation, and provides strong evidence towards making item sampling a powerful and reliable tool for recommendation evaluation.

翻译：自从 Rendle 和 Krichene 认为通常使用的基于采样的评估指标与全局指标不一致（即使在期望情况下），已经有一些研究基于采样的推荐系统评估。现有方法尝试将基于采样的指标映射到它们的全局对应指标，或更一般地，学习经验等级分布以估计前 K 个指标。然而，尽管已有努力，但对所提出的指标估算器的严格理论理解仍缺乏，基本物品采样也存在“盲区”问题，即在 K 较小时，恢复前 K 个指标的估计精度仍可能相当大。在本文中，我们对这些问题进行了深入研究，并进行了两项创新贡献。首先，我们提出了一种新的物品采样估计器，明确地优化了与基准的误差，并在理论上强调了它与先前工作的微妙差异。其次，我们提出了一种新的自适应采样方法，旨在解决“盲区”问题，并展示期望 - 最大化（EM）算法可以推广到这种情况。我们的实验结果证实了我们的统计分析和所提出工作的优越性。本研究有助于为采用物品采样指标进行推荐评估奠定理论基础，并为使物品采样成为强大而可靠的推荐评估工具提供了有力证据。