Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. When it comes to real-world scenarios such as recommendation system and online advertising, however, it is essential to consider the resource consumption of exploration. In practice, there is typically non-zero cost associated with executing a recommendation (arm) in the environment, and hence, the policy should be learned with a fixed exploration cost constraint. It is challenging to learn a global optimal policy directly, since it is a NP-hard problem and significantly complicates the exploration and exploitation trade-off of bandit algorithms. Existing approaches focus on solving the problems by adopting the greedy policy which estimates the expected rewards and costs and uses a greedy selection based on each arm's expected reward/cost ratio using historical observation until the exploration resource is exhausted. However, existing methods are hard to extend to infinite time horizon, since the learning process will be terminated when there is no more resource. In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint. HATCH adopts an adaptive method to allocate the exploration resource based on the remaining resource/time and the estimation of reward distribution among different user contexts. In addition, we utilize full of contextual feature information to find the best personalized recommendation. Finally, in order to prove the theoretical guarantee, we present a regret bound analysis and prove that HATCH achieves a regret bound as low as $O(\sqrt{T})$. The experimental results demonstrate the effectiveness and efficiency of the proposed method on both synthetic data sets and the real-world applications.
翻译:多武装土匪(MAB)在各种问题上取得了最先进的表现。但是,在建议系统和在线广告等现实世界情景中,必须考虑勘探的资源消耗。在实践中,通常与执行环境建议(武器)有关的非零成本与执行环境建议(武器)有关,因此,应当用固定勘探成本限制来学习政策。直接学习全球最佳政策是具有挑战性的,因为它是一个NP硬的问题,极大地使对盗匪算法的探索和开采交易复杂化。现有方法侧重于解决问题,采用贪婪的政策,估计预期的收益和成本,并根据每只手预期的奖励/成本比率,在勘探资源用尽之前,采用贪婪的选择。然而,现行方法很难扩大到无限的时间范围,因为当没有资源时,学习进程就会结束。我们在本文件中提议了一种等级适应性背景土匪(HATCH) 方法,在预算约束下进行政策学习。 HATCHT采用一种适应性方法,根据每个手头的预期的奖励率和成本比率,根据历史观察,根据每个手头的奖励率比率,将探索资源分配资源,最后的排序,我们用目前对实际资源/历史数据进行真正的评估,我们使用最佳的排序,我们用目前的评估。