Recently, auto-bidding technique has become an essential tool to increase the revenue of advertisers. Facing the complex and ever-changing bidding environments in the real-world advertising system (RAS), state-of-the-art auto-bidding policies usually leverage reinforcement learning (RL) algorithms to generate real-time bids on behalf of the advertisers. Due to safety concerns, it was believed that the RL training process can only be carried out in an offline virtual advertising system (VAS) that is built based on the historical data generated in the RAS. In this paper, we argue that there exists significant gaps between the VAS and RAS, making the RL training process suffer from the problem of inconsistency between online and offline (IBOO). Firstly, we formally define the IBOO and systematically analyze its causes and influences. Then, to avoid the IBOO, we propose a sustainable online RL (SORL) framework that trains the auto-bidding policy by directly interacting with the RAS, instead of learning in the VAS. Specifically, based on our proof of the Lipschitz smooth property of the Q function, we design a safe and efficient online exploration (SER) policy for continuously collecting data from the RAS. Meanwhile, we derive the theoretical lower bound on the safety of the SER policy. We also develop a variance-suppressed conservative Q-learning (V-CQL) method to effectively and stably learn the auto-bidding policy with the collected data. Finally, extensive simulated and real-world experiments validate the superiority of our approach over the state-of-the-art auto-bidding algorithm.
翻译:最近,汽车招标技术已成为增加广告商收入的一个重要工具。面对真实世界广告系统(RAS)中复杂和不断变化的投标环境,最先进的汽车招标政策通常利用强化学习算法(RL)来代表广告商产生实时投标。出于安全考虑,我们认为,RL培训进程只能在离线虚拟广告系统(VAS)中进行,该系统是根据RAS中的历史数据建立起来的。在本文中,我们认为VAS和RAS之间存在巨大的差距,使得RL培训过程因在线和离线(IBOO)之间的不一致问题而受到影响。首先,我们正式界定IBOO,系统分析其原因和影响。随后,为避免IBOO,我们提议一个可持续的在线RL(SOR)框架,通过直接与RA系统直接互动来培训自动招标政策,而不是在VAS中学习。具体地说,我们根据Lipschitz公司平稳的Q函数属性的证据,使RLO值培训过程因在线和离线(我们不断从RSER系统收集数据),我们还设计一个安全的在线数据采集方法。