Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters. The former are often too conservative in practical settings, while the latter require assumptions that are hard to verify in practice. We study bandit problems that fall between these two extremes, where the learning agent has access to sampled bandit instances from an unknown prior distribution $\mathcal{P}$ and aims to achieve high reward on average over the bandit instances drawn from $\mathcal{P}$. This setting is of a particular importance because it lays foundations for meta-learning of bandit policies and reflects more realistic assumptions in many practical domains. We propose the use of parameterized bandit policies that are differentiable and can be optimized using policy gradients. This provides a broadly applicable framework that is easy to implement. We derive reward gradients that reflect the structure of bandit problems and policies, for both non-contextual and contextual settings, and propose a number of interesting policies that are both differentiable and have low regret. Our algorithmic and theoretical contributions are supported by extensive experiments that show the importance of baseline subtraction, learned biases, and the practicality of our approach on a range problems.
翻译:多数土匪政策旨在在任何问题中最大限度地减少遗憾,对基本环境作出很少的假设,或从巴伊西亚人的角度,假设先前的环境参数分配情况。前者通常在实际环境中过于保守,而后者则要求实际中难以核实的假设。我们研究了属于这两个极端之间的土匪问题,在这两个极端之间,学习代理人能够从一个未知的先前分发额$\mathcal{P}美元中获得抽样的土匪事件,目的是平均地从从从$\mathcal{P}美元中获取高额奖励。这一设置特别重要,因为它为土匪政策的元学习打下了基础,并反映了许多实际领域的更现实的假设。我们建议使用可区别的参数化土匪政策,并可利用政策梯度加以优化。这提供了一个易于执行的广泛适用的框架。我们从非通俗性和背景背景环境的角度来奖励反映土匪问题和政策结构的梯度,并提出一些既可不同又具有低遗憾的有趣政策。我们所了解的理论和理论性贡献的重要性得到了广泛的实验的支持。我们所学的理论性偏差和理论性研究,显示了我们所了解的基线的重要性。