学习到兰克的学习中不断提炼 (Toward Understanding Privileged Features Distillation in Learning-to-Rank)

In learning-to-rank problems, a privileged feature is one that is available during model training, but not available at test time. Such features naturally arise in merchandised recommendation systems; for instance, "user clicked this item" as a feature is predictive of "user purchased this item" in the offline data, but is clearly not available during online serving. Another source of privileged features is those that are too expensive to compute online but feasible to be added offline. Privileged features distillation (PFD) refers to a natural idea: train a "teacher" model using all features (including privileged ones) and then use it to train a "student" model that does not use the privileged features. In this paper, we first study PFD empirically on three public ranking datasets and an industrial-scale ranking problem derived from Amazon's logs. We show that PFD outperforms several baselines (no-distillation, pretraining-finetuning, self-distillation, and generalized distillation) on all these datasets. Next, we analyze why and when PFD performs well via both empirical ablation studies and theoretical analysis for linear models. Both investigations uncover an interesting non-monotone behavior: as the predictive power of a privileged feature increases, the performance of the resulting student model initially increases but then decreases. We show the reason for the later decreasing performance is that a very predictive privileged teacher produces predictions with high variance, which lead to high variance student estimates and inferior testing performance.

翻译：在学习到上的问题中,一个特长特征是模型培训期间可以找到的,但在测试时无法找到。这些特长自然出现在商品建议系统中;例如,“用户点击此项目”是一个特例,在离线数据中预测“用户购买此项目”,但在在线服务期间显然无法找到。另一个特长特征来源是那些过于昂贵,无法在网上进行计算,但可以添加到离线的特质。独特质蒸馏(PFD)是指一个自然的理念:利用所有特质(包括特长)培训“教师”模型,然后用它来培训不使用特长特征的“学生”模型。在本文件中,我们首先从三个公共排名数据集和从亚马逊日志中得出的工业级排名问题中进行实证性研究。我们显示,PFD在所有这些数据集上比几个基准(不蒸馏、培训前师级调整、自我蒸馏和普遍变异性调)都比重。接下来,我们分析为什么PFDD在不使用优劣等高性能测试时,通过实验性能分析,先验测,先验测,然后分析,然后分析,然后分析。我们分析。我们分析一个令人感兴趣的性能演化的成绩,然后分析,然后分析,然后分析,然后分析。我们分析,将学生的高级性能演化的性能演化性能,然后分析。我们不动性能演化,将实验性能演化,将演化,将演化。