In many applications, such as recommender systems, online advertising, and product search, click-through rate (CTR) prediction is a critical task, because its accuracy has a direct impact on both platform revenue and user experience. In recent years, with the prevalence of deep learning, CTR prediction has been widely studied in both academia and industry, resulting in an abundance of deep CTR models. Unfortunately, there is still a lack of a standardized benchmark and uniform evaluation protocols for CTR prediction. This leads to the non-reproducible and even inconsistent experimental results among these studies. In this paper, we present an open benchmark (namely FuxiCTR) for reproducible research and provide a rigorous comparison of different models for CTR prediction. Specifically, we ran over 4,600 experiments for a total of more than 12,000 GPU hours in a uniform framework to re-evaluate 24 existing models on two widely-used datasets, Criteo and Avazu. Surprisingly, our experiments show that many models have smaller differences than expected and sometimes are even inconsistent with what reported in the literature. We believe that our benchmark could not only allow researchers to gauge the effectiveness of new models conveniently, but also share some good practices to fairly compare with the state of the arts. We will release all the code and benchmark settings.
翻译:在许多应用中,如建议系统、在线广告和产品搜索、点击率(CTR)预测是一项关键任务,因为它的准确性直接影响到平台收入和用户经验。近年来,随着深层学习的普及,CTR预测在学术界和行业都得到了广泛研究,导致大量深层CTR模型。不幸的是,CTR预测仍然缺乏标准化的基准和统一的评价协议。这导致这些研究之间无法重复甚至不一致的实验结果。在本文中,我们提出了一个可复制研究的开放基准(即FUxICTR),并对CTR预测的不同模型进行严格比较。具体地说,我们在一个统一框架内进行了超过4,600多个实验,总共超过12,000个GPU小时,用于重新评估两个广泛使用的数据集(Crito和Avazu)的现有24个模型。令人惊讶的是,我们的实验表明,许多模型的差异比预期的要小,有时甚至与文献中所报道的不一致。我们认为,我们的基准不仅可以让研究人员对新模型的实用性进行对比,而且还可以对新模型的规范进行公正的比较。