We consider the continuum-armed bandits problem, under a novel setting of recommending the best arms within a fixed budget under aggregated feedback. This is motivated by applications where the precise rewards are impossible or expensive to obtain, while an aggregated reward or feedback, such as the average over a subset, is available. We constrain the set of reward functions by assuming that they are from a Gaussian Process and propose the Gaussian Process Optimistic Optimisation (GPOO) algorithm. We adaptively construct a tree with nodes as subsets of the arm space, where the feedback is the aggregated reward of representatives of a node. We propose a new simple regret notion with respect to aggregated feedback on the recommended arms. We provide theoretical analysis for the proposed algorithm, and recover single point feedback as a special case. We illustrate GPOO and compare it with related algorithms on simulated data.
翻译:我们考虑的是连续武装匪徒问题,这是在根据综合反馈在固定预算内建议最佳武器的新背景下进行的,其动机是应用不可能获得或获得准确的奖赏,而获得的奖励或反馈,如子集的平均数等总奖赏或反馈,我们假设奖赏功能来自高山进程,并提议高山进程最佳优化算法,以此限制整套奖赏功能。我们适应性地建造一棵树,以节点作为臂空间的子组,反馈是节点代表的累积奖赏。我们提出了关于建议武器综合反馈的新的简单遗憾概念。我们为提议的算法提供理论分析,并作为特例回收单一点反馈。我们演示GPOO,并将它与模拟数据的相关算法进行比较。