In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus does not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.
翻译:在文章中,我们提出一个新的基于悲观的巴耶斯学习方法,以便在离线环境下建立最佳的动态治疗制度。当覆盖面条件不维持时(这是离线数据常见的),现有解决方案将产生亚最佳政策。悲观原则通过阻止对不那么受国家制约的行动的建议来解决这一问题。然而,几乎所有基于悲观的方法都依赖一个关键的超参数,以量化悲观程度,而方法的性能对选择这一参数非常敏感。我们提议将悲观原则与汤普森抽样和巴耶斯机器学习相结合,以优化悲观程度。我们制定一套可靠的标准,其边界统一地降低最佳Q功能的界限,因此不需要对悲观程度作更多的调整。我们开发了一种与一系列模型(从巴耶斯线性模型到巴耶斯神经网络模型)相配合的一般巴耶斯学习方法。我们根据变异的推断方法制定计算算法,以优化悲观程度为优化的模型,我们通过高效率的理论和高尺度的模拟方法来展示现有的数据格式。我们通过一种高效率的模型和高尺度来证明它的真实形式。