Since Rendle and Krichene argued that commonly used sampling-based evaluation metrics are ``inconsistent'' with respect to the global metrics (even in expectation), there have been a few studies on the sampling-based recommender system evaluation. Existing methods try either mapping the sampling-based metrics to their global counterparts or more generally, learning the empirical rank distribution to estimate the top-$K$ metrics. However, despite existing efforts, there is still a lack of rigorous theoretical understanding of the proposed metric estimators, and the basic item sampling also suffers from the ``blind spot'' issue, i.e., estimation accuracy to recover the top-$K$ metrics when $K$ is small can still be rather substantial. In this paper, we provide an in-depth investigation into these problems and make two innovative contributions. First, we propose a new item-sampling estimator that explicitly optimizes the error with respect to the ground truth, and theoretically highlight its subtle difference against prior work. Second, we propose a new adaptive sampling method which aims to deal with the ``blind spot'' problem and also demonstrate the expectation-maximization (EM) algorithm can be generalized for such a setting. Our experimental results confirm our statistical analysis and the superiority of the proposed works. This study helps lay the theoretical foundation for adopting item sampling metrics for recommendation evaluation, and provides strong evidence towards making item sampling a powerful and reliable tool for recommendation evaluation.
翻译:由于Rendle和Krichene认为,普遍使用的基于抽样的评价指标与全球指标“不一致”(即使预期也是如此),因此对基于抽样的建议系统评价进行了一些研究。现有的方法要么试图向全球对应方或更笼统地向全球对应方测绘基于抽样的指标,学习经验等级分布情况,以估计最高-K美元指标。然而,尽管已作出努力,对拟议的衡量估计标准标准仍然缺乏严格的理论理解,而基本项目抽样也因“盲点”问题而受到影响,即当美元数额小时,为回收最高-K美元指标估计准确性,仍然可以相当大量。在本文件中,我们对这些问题进行深入的调查,并作出两项创新贡献。首先,我们提出一个新的项目抽样估计标准,明确优化与地面事实有关的错误,从理论上强调与先前工作之间的微妙差异。第二,我们提议一种新的适应抽样方法,旨在处理“盲点”的强点,即当美元数额小时,为收回最高额的美元,估计最高额的精确度指标,仍然可以相当地进行。我们提出深入的调查调查这些问题,同时还要证实我们提议的统计结果的理论基础分析。