Extracting expressive visual features is crucial for accurate Click-Through-Rate (CTR) prediction in visual search advertising systems. Current commercial systems use off-the-shelf visual encoders to facilitate fast online service. However, the extracted visual features are coarse-grained and/or biased. In this paper, we present a visual encoding framework for CTR prediction to overcome these problems. The framework is based on contrastive learning which pulls positive pairs closer and pushes negative pairs apart in the visual feature space. To obtain fine-grained visual features,we present contrastive learning supervised by click through data to fine-tune the visual encoder. To reduce sample selection bias, firstly we train the visual encoder offline by leveraging both unbiased self-supervision and click supervision signals. Secondly, we incorporate a debiasing network in the online CTR predictor to adjust the visual features by contrasting high impression items with selected items with lower impressions.We deploy the framework in the visual sponsor search system at Alibaba. Offline experiments on billion-scale datasets and online experiments demonstrate that the proposed framework can make accurate and unbiased predictions.
翻译:在视觉搜索广告系统中,提取直观的视觉特征对于准确的 Click-Trough-Rate (CTR) 预测至关重要。当前的商业系统使用现成的视觉编码器便利快速在线服务。 然而,提取的视觉特征粗略粗略和/或偏差。 在本文中,我们为CTR预测克服这些问题提供了一个视觉编码框架。这个框架的基础是对比性学习,在视觉特征空间中将正面的对子拉近,将负对子拉开。要获得细微的视觉特征,我们通过点击数据来监督对比性学习。为了减少选择样本的偏差,我们首先通过利用不带偏见的自我观察和点击监督信号来培训离线的视觉编码器。第二,我们在在线CTR预测器中加入了一个偏差网络,通过将高印象项目与某些印象较低的项目作对比来调整视觉特征。我们在Alibaba的视觉支持搜索系统中部署这一框架。在10亿尺度的数据集和在线实验中进行离线实验,以显示拟议的框架能够作出准确和不偏倚的预测。