离线神经环境上的强盗:悲观主义、优化和普及化 (Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization)

Offline policy learning (OPL) leverages existing data collected a priori for policy optimization without any active exploration. Despite the prevalence and recent interest in this problem, its theoretical and algorithmic foundations in function approximation settings remain under-developed. In this paper, we consider this problem on the axes of distributional shift, optimization, and generalization in offline contextual bandits with neural networks. In particular, we propose a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward. We show that our method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. Notably, unlike any other OPL method, our method learns from the offline data in an online manner using stochastic gradient descent, allowing us to leverage the benefits of online learning into an offline setting. Moreover, we show that our method is more computationally efficient and has a better dependence on the effective dimension of the neural network than an online counterpart. Finally, we demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.

翻译：离线政策学习(OPL)在没有任何积极探索的情况下,利用了为政策优化而事先收集的现有数据,而没有进行任何积极的探索。尽管这个问题很普遍,而且最近对此问题很感兴趣,但其在功能近似环境中的理论和算法基础仍然发展不足。在本文中,我们考虑了在分布转移、优化和神经网络离线背景强盗的轴心上的问题。特别是,我们建议了一种效率高的离线背景强盗,其神经网络功能近似不需要任何功能上的假设。我们表明,我们的方法在比现有的OPL工作更温和的分布转移条件下,对看不见的环境进行了广泛概括。值得注意的是,与其它OPL方法不同,我们的方法以在线方式学习离线数据,使用随机梯度梯度梯度下降法,使我们能够将在线学习的好处运用到离线环境中。此外,我们表明,我们的方法比在线对神经网络的有效层面比对在线对应方更具有更高的计算效率。最后,我们展示了我们方法在合成和实际世界OPL问题范围内的经验效力。