先前更新时的线性强盗 (Metalearning Linear Bandits by Prior Update)

Fully Bayesian approaches to sequential decision-making assume that problem parameters are generated from a known prior, while in practice, such information is often lacking, and needs to be estimated through learning. This problem is exacerbated in decision-making setups with partial information, where using a misspecified prior may lead to poor exploration and inferior performance. In this work we prove, in the context of stochastic linear bandits and Gaussian priors, that as long as the prior estimate is sufficiently close to the true prior, the performance of an algorithm that uses the misspecified prior is close to that of the algorithm that uses the true prior. Next, we address the task of learning the prior through metalearning, where a learner updates its estimate of the prior across multiple task instances in order to improve performance on future tasks. The estimated prior is then updated within each task based on incoming observations, while actions are selected in order to maximize expected reward. In this work we apply this scheme within a linear bandit setting, and provide algorithms and regret bounds, demonstrating its effectiveness, as compared to an algorithm that knows the correct prior. Our results hold for a broad class of algorithms, including, for example, Thompson Sampling and Information Directed Sampling.

翻译：完全巴伊西亚式的顺序决策方法假定问题参数来自已知的先前的某个已知,而在实践中,这种信息往往缺乏,需要通过学习来估计。这个问题在部分信息的决策设置中更加严重,因为使用错误的事先说明可能导致勘探不良和业绩低下。在这项工作中,我们证明,在随机线性线性土匪和高山前科中,只要先前的估计与之前的准确程度相当接近,使用错误描述的算法的性能就接近于使用之前真实的算法。接下来,我们处理通过金属学习来学习前一种算法的任务,即一个学习者更新其对过去跨多个任务情况的估计,以便改进未来任务的绩效。然后,根据收到的观察,在每项任务中更新前期估计数,同时选择行动,以便最大限度地获得预期的报酬。在线性土匪环境中应用这一办法,并提供算法和遗憾界限,表明其效力,与了解正确的前一种算法相比较。我们的成果维持了广泛的算法,包括直接的萨姆和直线性算法。