BARS:为建议系统制定公开基准 (BARS: Towards Open Benchmarking for Recommender Systems)

The past two decades have witnessed the rapid development of personalized recommendation techniques. Despite significant progress made in both research and practice of recommender systems, to date, there is a lack of a widely-recognized benchmarking standard in this field. Many existing studies perform model evaluations and comparisons in an ad-hoc manner, for example, by employing their own private data splits or using different experimental settings. Such conventions not only increase the difficulty in reproducing existing studies, but also lead to inconsistent experimental results among them. This largely limits the credibility and practical value of research results in this field. To tackle these issues, we present an initiative project (namely BARS) aiming for open benchmarking for recommender systems. In comparison to some earlier attempts towards this goal, we take a further step by setting up a standardized benchmarking pipeline for reproducible research, which integrates all the details about datasets, source code, hyper-parameter settings, running logs, and evaluation results. The benchmark is designed with comprehensiveness and sustainability in mind. It covers both matching and ranking tasks, and also enables researchers to easily follow and contribute to the research in this field. This project will not only reduce the redundant efforts of researchers to re-implement or re-run existing baselines, but also drive more solid and reproducible research on recommender systems. We would like to call upon everyone to use the BARS benchmark for future evaluation, and contribute to the project through the portal at: https://openbenchmark.github.io/BARS.

翻译：过去二十年来,个人化建议技术迅速发展,尽管迄今为止在建议系统的研究和实践方面都取得了显著进展,但在这一领域缺乏广泛公认的基准标准。许多现有研究以临时方式进行模型评估和比较,例如,利用自己的私人数据拆分或使用不同的实验设置。这些公约不仅增加了复制现有研究的难度,而且还导致这些公约之间的实验结果不一致。这在很大程度上限制了该领域研究成果的可信度和实际价值。为了解决这些问题,我们提出了一个旨在为建议系统建立公开基准的倡议项目(即BARS),目的是为推荐者系统建立公开基准。与早先为实现这一目标所作的一些尝试相比,我们进一步采取的一个步骤是建立一个标准化的基准管道,用于再生研究,将关于数据集、源代码、超参数设置、运行日志和评价结果的所有细节综合起来。基准的设计既全面又具有可持续性。它既包括匹配和排序任务,也使研究人员能够轻松地跟踪和推动这一领域的研究。与以前的一些尝试相比,我们通过建立一个标准化的基准管道,我们不仅能够减少现有的研究基线,而且能够将现有的基准重新引入。我们还将通过每个研究人员的基线来推动现有的基准的更新。