We study online learning with bandit feedback across multiple tasks, with the goal of improving average performance across tasks if they are similar according to some natural task-similarity measure. As the first to target the adversarial setting, we design a unified meta-algorithm that yields setting-specific guarantees for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-algorithm tunes the initialization, step-size, and entropy parameter of the Tsallis-entropy generalization of the well-known Exp3 method, with the task-averaged regret provably improving if the entropy of the distribution over estimated optima-in-hindsight is small. For BLO, we learn the initialization, step-size, and boundary-offset of online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with a measure induced by these functions on the interior of the action space. Our adaptive guarantees rely on proving that unregularized follow-the-leader combined with multiplicative weights is enough to online learn a non-smooth and non-convex sequence of affine functions of Bregman divergences that upper-bound the regret of OMD.
翻译:我们用匪徒的反馈研究多种任务的在线学习,目标是在与某些自然任务相似的情况下,提高不同任务的平均绩效,如果它们与某些自然任务相似,目标是改善不同任务的平均绩效。作为针对对抗性环境的第一个目标,我们设计了一个统一的元值,为两个重要案例提供具体保障:多武装土匪(MAB)和土匪线性优化(BLO)。对于MAB来说,元值和亚差调调调调调自调障碍调节器(OMD)的初始化、分级大小和缩放参数,显示任务平均遗憾与行动空间内部的这些函数引起的测量值直接不同。我们调整后保证要证明,对估计的“内向内”分配值分配值的不固定的后继率和“向后向”的分级差是相当的。对于BLO来说,我们适应性保证要证明,对不固定的后向后向导的O级的分级,要与不固定的上向式的上向级分级的分级的分级的分级,是足够、不相偏向式的双向式的分级的。