矢量优化与斯托查式土匪反馈 (Vector Optimization with Stochastic Bandit Feedback)

We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs, with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study the setting where an evaluation of each design yields a noisy observation of the mean reward vector. Under subgaussian noise assumption, we investigate the sample complexity of the na\"ive elimination algorithm in an ($\epsilon,\delta$)-PAC setting, where the goal is to identify an ($\epsilon,\delta$)-PAC Pareto set with the minimum number of design evaluations. In particular, we identify cone-dependent geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.

翻译：我们引入了矢量优化问题, 将最好的手臂识别问题扩大到矢量价值的奖励。我们考虑用多维平均奖赏矢量设计, 其多维平均奖赏矢量设计, 这些设计部分是根据多面订购的锥体C$来订购的。这概括了多目标优化中设定的帕雷托概念, 允许决策者的不同偏好由美元来编码。不同于以往的工作, 我们根据无方向覆盖和差距概念来定义帕雷托设定的近似值。我们研究每个设计评价的设定, 使平均奖赏矢量的观测变得吵闹。根据西伯里语的噪音假设, 我们用一个(\ epsilon,\ delta$)- PAC 设置来调查“ 消灭” 算法的样本复杂性, 目的是确定一个(\ epslon,\delta$)- PAC Pareto 设定的最低设计评价次数。我们特别确定了对实验矢量矢量矢量值值值偏离平均值的几何测度条件。我们运行了Pare dealtoroalto roalto roal to rodustration rodustration to roducal real reduflation the roducal exal roduflal rodududududududucalalal ex ex ex ex ex ex ex exaltaltalticlement.