Ad relevance modeling plays a critical role in online advertising systems including Microsoft Bing. To leverage powerful transformers like BERT in this low-latency setting, many existing approaches perform ad-side computations offline. While efficient, these approaches are unable to serve cold start ads, resulting in poor relevance predictions for such ads. This work aims to design a new, low-latency BERT via structured pruning to empower real-time online inference for cold start ads relevance on a CPU platform. Our challenge is that previous methods typically prune all layers of the transformer to a high, uniform sparsity, thereby producing models which cannot achieve satisfactory inference speed with an acceptable accuracy. In this paper, we propose SwiftPruner - an efficient framework that leverages evolution-based search to automatically find the best-performing layer-wise sparse BERT model under the desired latency constraint. Different from existing evolution algorithms that conduct random mutations, we propose a reinforced mutator with a latency-aware multi-objective reward to conduct better mutations for efficiently searching the large space of layer-wise sparse models. Extensive experiments demonstrate that our method consistently achieves higher ROC AUC and lower latency than the uniform sparse baseline and state-of-the-art search methods. Remarkably, under our latency requirement of 1900us on CPU, SwiftPruner achieves a 0.86% higher AUC than the state-of-the-art uniform sparse baseline for BERT-Mini on a large scale real-world dataset. Online A/B testing shows that our model also achieves a significant 11.7% cut in the ratio of defective cold start ads with satisfactory real-time serving latency.
翻译:相关模型在包括微软 Bing 在内的在线广告系统中发挥着关键作用。 要在这种低纬度环境下利用BERT等强大的变压器, 许多现有方法都使用离线的自动计算。 虽然效率不高, 但这些方法无法提供冷启动广告, 导致对此类广告的适切性预测不力。 这项工作的目的是通过结构化的剪裁设计一个新的低纬度的BERT 模型, 以赋予在CPU平台上进行随机启动的实时在线推断的适切性。 我们的挑战在于, 以往的方法通常将变压器的所有层都淡化到高、 统一的宽度, 从而产生无法以可接受的准确度达到满意的准确度推断速度的模型。 在本文中,我们提议SwiftPrunernert 高效启动广告, 从而利用基于进化的搜索工具自动找到最佳的分层稀薄的BERT模型。 不同于现有的进行随机变动的演算法, 我们提议一个强化的变压器, 以惯性/多目的奖励, 以更好的变压器进行更佳的变压式的变压式,, 以高效地搜索大的更高级的更高级的更高级的更高级的更高级的更高级的更高级的更精确的更精确的更精确的推算法 。