One of the challenges in display advertising is that the distribution of features and click through rate (CTR) can exhibit large shifts over time due to seasonality, changes to ad campaigns and other factors. The predominant strategy to keep up with these shifts is to train predictive models continuously, on fresh data, in order to prevent them from becoming stale. However, in many ad systems positive labels are only observed after a possibly long and random delay. These delayed labels pose a challenge to data freshness in continuous training: fresh data may not have complete label information at the time they are ingested by the training algorithm. Naive strategies which consider any data point a negative example until a positive label becomes available tend to underestimate CTR, resulting in inferior user experience and suboptimal performance for advertisers. The focus of this paper is to identify the best combination of loss functions and models that enable large-scale learning from a continuous stream of data in the presence of delayed labels. In this work, we compare 5 different loss functions, 3 of them applied to this problem for the first time. We benchmark their performance in offline settings on both public and proprietary datasets in conjunction with shallow and deep model architectures. We also discuss the engineering cost associated with implementing each loss function in a production environment. Finally, we carried out online experiments with the top performing methods, in order to validate their performance in a continuous training scheme. While training on 668 million in-house data points offline, our proposed methods outperform previous state-of-the-art by 3% relative cross entropy (RCE). During online experiments, we observed 55% gain in revenue per thousand requests (RPMq) against naive log loss.
翻译:展示广告的挑战之一是,由于季节性、广告运动的变化和其他因素,功能分布和点击率(CTR)可能会随着时间的变化而发生巨大的变化。 跟上这些变化的主要策略是不断根据最新数据对预测模型进行连续培训,以防止其变质。 然而,在许多广告系统中,只有在可能长期和随机的延迟之后才会观察到积极的标签。这些延迟标签对连续培训中的数据新鲜度构成了挑战:在培训算法中,新数据可能没有完整的标签信息。 将任何数据点视为负面的例子,直到有一个积极的标签出现,往往低估了CTR。 跟上这些变化的主要策略是不断根据最新数据对模型进行预测模型培训, 使损失函数和模型的最佳组合使得在存在延迟标签的情况下能够从连续的数据流中进行大规模学习。 在这项工作中,我们比较了5个不同的损失函数,其中3个是首次应用到这个问题的。 在离线设置任何数据点时,我们将数据点的负值视为一个负的模型,直到有正值的标签出现低估CTR, 导致用户经验低劣经验和广告业绩。 本文的重点是每个连续的模型运行中,我们连续的模型 。