In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus we do not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.
翻译:在文章中,我们提出一个新的基于悲观的巴耶斯学习方法,以便在离线环境下建立最佳的动态治疗制度。当覆盖面条件不维持时(这是离线数据常见的),现有解决方案将产生亚优政策。悲观原则通过抑制不那么受国家制约的行动建议来解决这一问题。然而,几乎所有基于悲观的方法都依赖一个关键的超参数,以量化悲观程度,而方法的性能对选择这一参数非常敏感。我们提议将悲观原则与汤普森抽样和巴耶斯机器学习相结合,以优化悲观程度。我们制定一套可靠的标准,其边界统一较低,最佳Q功能,因此我们不需要对悲观程度作更多的调整。我们开发一种与一系列模型(从巴耶斯线性模型到巴耶斯神经网络模型)相配合的一般巴耶斯学习方法。我们建议根据变异性推断法和巴耶斯机器学习,将悲观原则与优化悲观的程度结合起来。我们提出了一套可靠的计算算法,通过高效益和高度的模拟方法,我们通过目前的模拟方法,确立了一种高效益和高度的理论形式,展示了它的真实方法。