多武装匪徒介绍 (Introduction to Multi-Armed Bandits)

from arxiv, Published with Foundations and Trends(R) in Machine Learning, November 2019. The present version is a revision of the "Foundations and Trends" publication. It contains numerous edits for presentation and accuracy (based in part on readers' feedback), updated and expanded literature reviews, and some new exercises

Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a brief review of the further developments; many of the chapters conclude with exercises. The book is structured as follows. The first four chapters are on IID rewards, from the basic model to impossibility results to Bayesian priors to Lipschitz rewards. The next three chapters cover adversarial rewards, from the full-feedback version to adversarial bandits to extensions with linear rewards and combinatorially structured actions. Chapter 8 is on contextual bandits, a middle ground between IID and adversarial bandits in which the change in reward distributions is completely explained by observable contexts. The last three chapters cover connections to economics, from learning in repeated games to bandits with supply/budget constraints to exploration in the presence of incentives. The appendix provides sufficient background on concentration and KL-divergence. The chapters on "bandits with similarity information", "bandits with knapsacks" and "bandits and agents" can also be consumed as standalone surveys on the respective topics.

翻译：多武装匪徒是一个简单但非常强大的算法框架,这种算法框架可以随着时间的推移在不确定的情况下作出决定。多年来积累了大量的工作,包括若干书籍和调查。这本书提供了更加介绍性的、教科书类的对这个主题的处理方法。每一章都涉及一个特定的一行工作,提供了自成一体的、可传授的技术介绍和对进一步发展的简要回顾;许多章节以练习结束。这本书的结构如下:前四章是关于IID奖赏,从基本模式到不可能的结果,直到利普西茨奖之前的巴伊西亚人。后面三章涉及对抗性奖赏,从全击版到对抗性强盗,到延展线性奖赏和组合性结构行动。第八章是关于背景强盗的,IID和敌对性强盗之间的中间基础,其奖赏分配的变化完全以可观察的背景来解释。最后三章涉及经济学,从重复游戏学习到供应/预算制约的匪徒到奖励的探索。附录中提供了足够的关于集中和KLivergence的背景。关于集中和K&rence的反攻势的背景,到延线性奖项和组合式结构行动的扩展行动。第八章的章节是“Bandbandbastistrational ”的章节,以及“bastitutions”的“bliblibistims”的章节,可以作为“blibaltimetimetime”的“brictions”的“bations”的章节,可以作为“brictions”中“brictions”中“briction和Bastime”的“bation。