We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints. The goal of the decision maker is to maximize their total reward, while at the same time achieving small cumulative constraints violation across the $T$ rounds. We present the first best-of-both-world type algorithm for this general class of problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown stochastic model, and in the case in which they are selected at each round by an adversary. Our algorithm is the first to provide guarantees in the adversarial setting with respect to the optimal fixed strategy that satisfies the long-term constraints. In particular, it guarantees a $\rho/(1+\rho)$ fraction of the optimal reward and sublinear regret, where $\rho$ is a feasibility parameter related to the existence of strictly feasible solutions. Our framework employs traditional regret minimizers as black-box components. Therefore, by instantiating it with an appropriate choice of regret minimizers it can handle the full-feedback as well as the bandit-feedback setting. Moreover, it allows the decision maker to seamlessly handle scenarios with non-convex rewards and constraints. We show how our framework can be applied in the context of budget-management mechanisms for repeated auctions in order to guarantee long-term constraints that are not packing (e.g., ROI constraints).
翻译:我们研究在线学习问题,即决策者必须在长期限制下作出一系列决定,但需以百万美元为限。决策者的目标是最大限度地提高总报酬,同时在T美元回合中实现小规模累积限制。我们为这一大类问题提供了首个双世界最佳算法,在根据未知的随机模式选择奖赏和制约因素的情况下,以及在每轮由对手挑选这些奖赏和制约因素的情况下,不给予任何保证。我们的算法是第一个在对抗性环境下提供保证,保证其达到长期限制的最佳固定战略。特别是,它保证了最佳奖赏和亚线性遗憾中最优的一分数。 在这种情况下,美元是同存在严格可行的解决办法有关的可行性参数。我们的框架将传统的遗憾最小化者作为黑箱组成部分使用。因此,通过适当选择最小化者,我们算出它能够处理完全失败的最小化因素,作为满足长期限制的最佳固定战略。特别是,它保证了最佳奖赏和亚线性遗憾的一小部分,从而使得我们预算约束框架能够反复显示不固定的顺序。我们可以将预算约束置于不固定的制约中。