采用软最大化办法,改善瓶装环境中多目标决策的绩效 (Improving performance in multi-objective decision-making in Bottles environments with soft maximin approaches)

Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, 'split-function exp-log loss aversion' (SFELLA), learns faster than the state of the art thresholded alignment objective method (Vamplew et al, 2021) on three of four tasks it was tested on, and achieved the same optimal performance after learning. SFELLA also showed relative robustness improvements against changes in objective scale, which may highlight an advantage dealing with distribution shifts in the environment dynamics. Due to publishing rules, further work could not be presented in the preprint, but in the final published version, we will further compare SFELLA to the multi-objective reward exponentials (MORE) approach (Rolf, 2020), demonstrating that SFELLA performs similarly to MORE in a simple previously-described foraging task, but in a modified foraging environment with a new resource that was not depleted as the agent worked, SFELLA collected more of the new resource with very little cost incurred in terms of the old resource. Overall, we found SFELLA useful for avoiding problems that sometimes occur with a thresholded approach, and more reward-responsive than MORE while retaining its conservative, loss-averse incentive structure.

翻译：在满足人类价值或偏好的任何人工智能中,平衡多重竞争和相互冲突的目标是一项基本任务。冲突既产生于具有相竞价值的个人之间的不匹配,也产生于单个人持有的相互冲突的价值体系。从亏损反转原则开始,我们设计了一套针对多重目标决策的软最大功能方法。用一套以前开发的环境来标注这些功能,我们发现,一种新的方法,特别是“功能的扩展损失反转”(SFELLA),其学习速度高于其测试的四种任务中的三种(Vamplew et al, 2021),并在学习后取得同样的最佳性能。SFELLA还针对客观规模的变化展示了相对稳健性的变化,这可能突出处理环境动态中分布变化的优势。由于公布规则,无法在预印中提出进一步的工作,但在最后出版的版本中,我们将SFELLA与多目标的奖赏指数(MOE, Rolfle, et al, 202021) 方法(Vample-A)相比,在经过测试的四项任务中学习,在学习后取得同样的最佳业绩。SFELLA levelillA 方法,在以往的资源成本上也进行了类似的调整。