Click-through rate (CTR) prediction is of great importance in recommendation systems and online advertising platforms. When served in industrial scenarios, the user-generated data observed by the CTR model typically arrives as a stream. Streaming data has the characteristic that the underlying distribution drifts over time and may recur. This can lead to catastrophic forgetting if the model simply adapts to new data distribution all the time. Also, it's inefficient to relearn distribution that has been occurred. Due to memory constraints and diversity of data distributions in large-scale industrial applications, conventional strategies for catastrophic forgetting such as replay, parameter isolation, and knowledge distillation are difficult to be deployed. In this work, we design a novel drift-aware incremental learning framework based on ensemble learning to address catastrophic forgetting in CTR prediction. With explicit error-based drift detection on streaming data, the framework further strengthens well-adapted ensembles and freezes ensembles that do not match the input distribution avoiding catastrophic interference. Both evaluations on offline experiments and A/B test shows that our method outperforms all baselines considered.
翻译:点击率(CTR)预测在推荐系统和在线广告平台中非常重要。在工业场景中,CTR模型观察到的用户生成数据通常以流的形式到达。流数据具有随时间漂移的特征,并可能会重复。如果模型只是一直适应新的数据分布,这可能会导致灾难性的遗忘。此外,重新学习已经出现的分布是低效的。由于大规模工业应用中的内存约束和数据分布的多样性,常规的遗忘策略,如回放、参数隔离和知识蒸馏,难以部署。在这项工作中,我们设计了一种基于集成学习的漂移感知的增量学习框架,以解决CTR预测中的灾难性遗忘。随着流数据的显式基于误差的漂移检测,这个框架进一步强化了适应良好的集成,并冻结了与输入分布不匹配的集成,避免了灾难性干扰。离线实验和A / B测试的评估结果表明,我们的方法优于所有考虑的基线。