Exploration-Exploitation (E{\&}E) algorithms are commonly adopted to deal with the feedback-loop issue in large-scale online recommender systems. Most of existing studies believe that high uncertainty can be a good indicator of potential reward, and thus primarily focus on the estimation of model uncertainty. We argue that such an approach overlooks the subsequent effect of exploration on model training. From the perspective of online learning, the adoption of an exploration strategy would also affect the collecting of training data, which further influences model learning. To understand the interaction between exploration and training, we design a Pseudo-Exploration module that simulates the model updating process after a certain item is explored and the corresponding feedback is received. We further show that such a process is equivalent to adding an adversarial perturbation to the model input, and thereby name our proposed approach as an the Adversarial Gradient Driven Exploration (AGE). For production deployment, we propose a dynamic gating unit to pre-determine the utility of an exploration. This enables us to utilize the limited amount of resources for exploration, and avoid wasting pageview resources on ineffective exploration. The effectiveness of AGE was firstly examined through an extensive number of ablation studies on an academic dataset. Meanwhile, AGE has also been deployed to one of the world-leading display advertising platforms, and we observe significant improvements on various top-line evaluation metrics.
翻译:在大型在线推荐人系统中,通常采用勘探-探索算法(E ⁇ E)来处理反馈-浏览问题。大多数现有研究都认为,高度不确定性可以成为潜在报酬的良好指标,因此主要侧重于模型不确定性的估计。我们认为,这种方法忽略了探索对模型培训的随后影响。从在线学习的角度来看,采用勘探战略也会影响培训数据的收集,从而进一步影响模型学习。为了了解勘探和培训之间的相互作用,我们设计了一个模拟模型更新过程的优度-探索模块,在探索某个项目并收到相应的反馈后模拟模型更新过程。我们进一步表明,这种过程相当于给模型输入增加一个对立的渗透,从而将我们拟议的方法命名为对模型培训的快速驱动探索(AGAGE)。关于生产部署,我们提出一个动态的定位单位,以预先确定勘探的效用。这使我们能够利用有限的资源进行勘探,并避免在无效的探索中浪费页面资源。我们进一步表明,这样一个过程相当于给模型输入一个对抗性干扰,从而将我们拟议的方法命名为“快速探索” 。我们第一次通过对所部署的高级数据库进行了广泛的实地评估,我们通过对一个世界进行一项重大的升级研究。