Online learning algorithms, widely used to power search and content optimization on the web, must balance exploration and exploitation, potentially sacrificing the experience of current users in order to gain information that will lead to better decisions in the future. While necessary in the worst case, explicit exploration has a number of disadvantages compared to the greedy algorithm that always "exploits" by choosing an action that currently looks optimal. We ask under what conditions inherent diversity in the data makes explicit exploration unnecessary. We build on a recent line of work on the smoothed analysis of the greedy algorithm in the linear contextual bandits model. We improve on prior results to show that a greedy approach almost matches the best possible Bayesian regret rate of any other algorithm on the same problem instance whenever the diversity conditions hold, and that this regret is at most $\tilde O(T^{1/3})$.
翻译:在线学习算法被广泛用来在网络上推动搜索和内容优化,必须平衡探索和开发,可能牺牲当前用户的经验,以便获取更好的未来决策所需的信息。 虽然在最坏的情况下,明确的探索是必要的,但与贪婪算法相比有一些缺点,贪婪算法总是“开发”方式,而贪婪算法总是选择目前看起来最佳的行动。 我们问在何种条件下数据内在的多样性使得明确探索没有必要。 我们以最近的工作为基础,对线性背景土匪模型中的贪婪算法进行了顺利分析。 我们改进了先前的结果,以显示贪婪方法几乎与同一问题实例上任何其他巴伊西亚人的最好遗憾率相匹配,只要多样性条件允许,这种遗憾最多是$\tilde O(T ⁇ 1/3)。