We consider the decision-making framework of online convex optimization with a very large number of experts. This setting is ubiquitous in contextual and reinforcement learning problems, where the size of the policy class renders enumeration and search within the policy class infeasible. Instead, we consider generalizing the methodology of online boosting. We define a weak learning algorithm as a mechanism that guarantees multiplicatively approximate regret against a base class of experts. In this access model, we give an efficient boosting algorithm that guarantees near-optimal regret against the convex hull of the base class. We consider both full and partial (a.k.a. bandit) information feedback models. We also give an analogous efficient boosting algorithm for the i.i.d. statistical setting. Our results simultaneously generalize online boosting and gradient boosting guarantees to contextual learning model, online convex optimization and bandit linear optimization settings.
翻译:我们用大量专家来考虑在线Convex优化的决策框架。 这种设置在背景和强化学习问题中是无处不在的, 政策类的大小使得政策类的查点和搜索不可行。 相反, 我们考虑推广在线促进方法。 我们定义了一种薄弱的学习算法, 作为一种机制, 保证对基础专家类的倍增近似遗憾。 在这种访问模型中, 我们给出一种高效的推算法, 保证对基础类的卷积感到近乎最佳的遗憾。 我们考虑的是完整和部分的信息反馈模型( a.k.a. bandit) 。 我们还给i. d. 统计设置提供类似的高效推动算法。 我们的结果同时将在线提法和梯度提法的保障概括到背景学习模型、 在线convex 优化和 条形线优化设置 。