矢量优化与斯托查式土匪反馈 (Vector Optimization with Stochastic Bandit Feedback)

We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of the Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study an ($\epsilon,\delta$)-PAC Pareto set identification problem where an evaluation of each design yields a noisy observation of the mean reward vector. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of {\em ordering complexity}, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We provide gap-dependent and worst-case lower bounds on the sample complexity and show that in the worst-case the sample complexity scales with the square of ordering complexity. Furthermore, we investigate the sample complexity of the na\"ive elimination algorithm and prove that it nearly matches the worst-case sample complexity. Finally, we run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.

翻译：我们引入了使用随机盗匪反馈的矢量优化问题, 将最佳手臂识别问题扩大到矢量价值的奖赏。我们考虑用多维平均奖励矢量设计的K$设计, 这些设计部分是根据多面性定购的 C美元。这概括了多目标优化中帕雷托设定的概念, 允许决策者的不同偏好由美元来编码。不同于以往的工作, 我们根据无方向覆盖和差距概念来定义帕雷托设定的近似值。我们研究一个( efslon,\delta$)- PAC Pareto 设置一个多维度平均奖赏矢量的识别问题, 每个设计评价都会产生对平均奖赏矢量的强烈观察。为了描述学习帕雷托设定的难度, 我们引入了 Exem 排序复杂度概念, 即实证奖赏矢量偏离其最差的平均值, 我们如何根据无方向来计算任何多面性定价的复杂度。我们用最差和最低比例的样本来解释我们最差的样本和最差的比。