We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider $K$ designs, with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone $C$. This generalizes the concept of Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by $C$. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study the setting where an evaluation of each design yields a noisy observation of the mean reward vector. Under subgaussian noise assumption, we investigate the sample complexity of the na\"ive elimination algorithm in an ($\epsilon,\delta$)-PAC setting, where the goal is to identify an ($\epsilon,\delta$)-PAC Pareto set with the minimum number of design evaluations. In order to characterize the difficulty of learning the Pareto set, we introduce the concept of ordering complexity, i.e., geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We show how to compute the ordering complexity of any polyhedral ordering cone. We run experiments to verify our theoretical results and illustrate how $C$ and sampling budget affect the Pareto set, returned ($\epsilon,\delta$)-PAC Pareto set and the success of identification.
翻译:我们引入了矢量优化问题, 将最好的手臂识别问题扩大到矢量价值的奖赏。 我们考虑K$的设计, 其多维平均奖赏矢量的设计, 这些设计是根据多面性订单单价C$部分订购的。 这概括了多目标优化中设定的帕雷托概念, 允许决策者的不同偏好用美元编码。 与先前的工作不同, 我们根据无方向覆盖和差距的概念来定义帕雷托设定的近似值。 我们研究每个设计评价的设置, 产生对平均奖赏矢量的响亮观测。 根据亚库西语噪音假设, 我们调查一个多面性消除算法的样本复杂性( $\ epsilon,\delta$)- PAC 设置, 目标是确定一个(\ eplon,\delta$)- PAC Pareto 设置的最小设计评价次数。 为了描述学习 Pareto 设置的难度, 我们引入了以下概念: 订购复杂度, i.e. e. i.