关于评价神经序列项目建议模型抽样战略的案例研究 (A Case Study on Sampling Strategies for Evaluating Neural Sequential Item Recommendation Models)

At the present time, sequential item recommendation models are compared by calculating metrics on a small item subset (target set) to speed up computation. The target set contains the relevant item and a set of negative items that are sampled from the full item set. Two well-known strategies to sample negative items are uniform random sampling and sampling by popularity to better approximate the item frequency distribution in the dataset. Most recently published papers on sequential item recommendation rely on sampling by popularity to compare the evaluated models. However, recent work has already shown that an evaluation with uniform random sampling may not be consistent with the full ranking, that is, the model ranking obtained by evaluating a metric using the full item set as target set, which raises the question whether the ranking obtained by sampling by popularity is equal to the full ranking. In this work, we re-evaluate current state-of-the-art sequential recommender models from the point of view, whether these sampling strategies have an impact on the final ranking of the models. We therefore train four recently proposed sequential recommendation models on five widely known datasets. For each dataset and model, we employ three evaluation strategies. First, we compute the full model ranking. Then we evaluate all models on a target set sampled by the two different sampling strategies, uniform random sampling and sampling by popularity with the commonly used target set size of 100, compute the model ranking for each strategy and compare them with each other. Additionally, we vary the size of the sampled target set. Overall, we find that both sampling strategies can produce inconsistent rankings compared with the full ranking of the models. Furthermore, both sampling by popularity and uniform random sampling do not consistently produce the same ranking ...

翻译：目前,顺序项目建议模式通过计算小项目子集(目标集)的衡量标准进行比较,以加快计算速度。目标集包含相关项目和一组从全项集中抽样的负面项目。两种已知的负面项目抽样战略是统一的随机抽样和抽样,以更接近数据集中的项目频率分布。最近公布的关于顺序项目建议的文件依靠抽样,以受评价模型的受欢迎程度进行比较。然而,最近的工作已经表明,采用统一的随机抽样评估可能与完全排名不一致,即使用全项集(目标集)评价指标组获得的评级模式,这提出了通过受欢迎取样获得的排名是否等于完全排名的问题。在这项工作中,我们从观点角度重新评价当前最先进的顺序建议模型,这些抽样战略是否对模型的最后排名有影响。因此,我们根据五种广为人所知的数据集和模型培训了四种最近提出的顺序建议模型。我们采用三种评价战略,即用完全的抽样组来评价指标集,首先,我们用完全的抽样组进行完全排序的排名,然后我们用不同的标定的标定的排名来比较。我们用不同的标定的标定的排名来评估每个标定的排名,然后的标定的排序,我们用不同的标定的标的标定的标定的标定的标定的标的排名,然后进行所有的标定的标定的比。我们用不同的标定的标定的标定的标的标定的标的排序,然后的标的标的标的标的排序,然后进行所有标的排序,然后进行所有的标的标的标的标的标的标的标的标的比的比。