The bandit paradigm provides a unified modeling framework for problems that require decision-making under uncertainty. Because many business metrics can be viewed as rewards (a.k.a. utilities) that result from actions, bandit algorithms have seen a large and growing interest from industrial applications, such as search, recommendation and advertising. Indeed, with the bandit lens comes the promise of direct optimisation for the metrics we care about. Nevertheless, the road to successfully applying bandits in production is not an easy one. Even when the action space and rewards are well-defined, practitioners still need to make decisions regarding multi-arm or contextual approaches, on- or off-policy setups, delayed or immediate feedback, myopic or long-term optimisation, etc. To make matters worse, industrial platforms typically give rise to large action spaces in which existing approaches tend to break down. The research literature on these topics is broad and vast, but this can overwhelm practitioners, whose primary aim is to solve practical problems, and therefore need to decide on a specific instantiation or approach for each project. This tutorial will take a step towards filling that gap between the theory and practice of bandits. Our goal is to present a unified overview of the field and its existing terminology, concepts and algorithms -- with a focus on problems relevant to industry. We hope our industrial perspective will help future practitioners who wish to leverage the bandit paradigm for their application.
翻译:土匪范式为在不确定情况下需要决策的问题提供了一个统一的示范框架。由于许多商业衡量标准可以被视为行动产生的奖励(a.k.a.水电),因此土匪算盘从工业应用中看到了巨大的和日益增长的兴趣,例如搜索、建议和广告。事实上,土匪镜头带来了直接优化我们所关心的衡量标准的前景。然而,成功应用土匪生产的道路并不简单。即使行动空间和奖励是明确定义的,从业人员仍需要就多种武器或背景方法、政策上或非政策上的设置、延迟或即时反馈、短视或长期的优化等作出决定。为了让情况更糟,工业平台通常会产生大型行动空间,而现有方法往往会崩溃。关于这些专题的研究文献是广泛而广泛的,但这可能超过从业人员,他们的主要目的是解决实际问题,因此需要决定每个项目的具体即时态或方法。这一教义将迈出一步,以填补当前产业理论和实践中存在的差距,我们的目标就是以当前工业理论和实践的逻辑视角为重点。