We study Pareto optimality in multi-objective multi-armed bandit by providing a formulation of adversarial multi-objective multi-armed bandit and properly defining its Pareto regrets that can be generalized to stochastic settings as well. The regrets do not rely on any scalarization functions and reflect Pareto optimality compared to scalarized regrets. We also present new algorithms assuming both with and without prior information of the multi-objective multi-armed bandit setting. The algorithms are shown optimal in adversarial settings and nearly optimal in stochastic settings simultaneously by our established upper bounds and lower bounds on Pareto regrets. Moreover, the lower bound analyses show that the new regrets are consistent with the existing Pareto regret for stochastic settings and extend an adversarial attack mechanism from bandit to the multi-objective one.
翻译:我们研究多目标多武装匪徒中的Pareto最佳性,方法是提出对抗性多目标多武装匪徒的配方,并适当界定其帕雷托的遗憾,这种遗憾也可以普遍适用于随机环境,这种遗憾并不依赖任何缩放功能,而是反映Pareto的最佳性,与缩放式的遗憾相比,我们还提出新的算法,假定无论是否事先了解多目标多武装匪徒的设置情况,这些算法在对抗性环境中都是最理想的,在随机环境里也几乎是最佳的,因为我们已确立的上限和Pareto的下限,结果表明,新的遗憾与Pareto现有的对随机环境的遗憾一致,并将对抗性攻击机制从土匪扩大到多目标机制。