提供部分信息在线学习的小额损失范围 (Small-loss bounds for online learning with partial information)

We consider the problem of adversarial (non-stochastic) online learning with partial information feedback, where at each stage, a decision maker picks an action from a finite set of possible actions. We develop a black-box approach to solving such problems where the learner observes as feedback only losses of a subset of the actions that include the selected action. Specifically, when losses of actions are non-negative, under the graph-based feedback model introduced by Mannor and Shamir, we offer algorithms that attain the so called "small-loss" regret bounds with high probability. Prior to our work, there was no data-dependent guarantee for general feedback graphs even for pseudo-regret (without dependence on the number of actions, i.e., taking advantage of the increased information feedback). Addressing this, we provide a high probability small-loss guarantee. Taking advantage of the black-box nature of our technique, we show applications to getting high probability small loss guarantees for semi-bandits (including routing in networks) and contextual bandits, including possibly infinite comparator class (such as infinite possible strategies in contextual bandits) as well as learning with slowly changing (shifting) comparators. In the special case of classical bandit and semi-bandit problems, we provide optimal small-loss, high-probability guarantees of $\widetilde{O}(\sqrt{dL^{\star}})$ for the actual regret, where $d$ is the number of arms and $L^{\star}$ is the loss of the best arm or action, answering open questions of Neu. Previous work for bandits and semi-bandits offered analogous regret guarantee only for pseudo-regret and only in expectation.

翻译：我们考虑的是以部分信息反馈进行对抗性(非随机性)在线学习的问题,在每一个阶段,决策者从有限的一系列可能的行动中选择一个行动。我们开发了一个黑箱方法来解决这样的问题,因为学习者认为只有包括选定行动在内的一组行动的反馈才会丢失。具体地说,在曼诺和沙米尔采用的基于图形的反馈模式下,如果行动损失不是负面的,我们提供算法,达到所谓的“小额”遗憾界限的可能性很高。在我们的工作之前,一般反馈图表甚至对假的报复(不依赖行动的数量,即利用更多的信息反馈)没有数据依赖的保证。为此,我们提供了极有可能的小额保证。利用我们技术的黑箱性质,我们展示了为半黑盒(包括网络中的路径)和背景土匪行动获得高概率的小额损失保证(包括网络中的路径选择)以及可能无限的比较类别(例如背景值为$的骨质反馈图 ), 也只是用于正统性武器(正统性) 和正统性(正统性) 的正统性(正统性) 的正统性) 和正统性(正统性) 的正统性) 的正统性(正在变的正统) 的正统性) 的正统性) 以及正统性(正统性) 的正统性) 的正统性) 的正统性(正统性) 的正统性) 的正统性) 的正统性(正反) 向) 向) 向) 向) 向) 的正统性(正反反向) 向) 向) 向) 向) 向) 向) 向) 向) 提供的正規提供的正反的正反向) 向向向向向向向提供的的的的的的提供提供的的的的的的向向向提供的的的的的的向的的的向向的向的向向的的的向向向向向向向向向向