Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.
翻译:强化学习的许多实际应用(RL)需要在连续的行动环境中做出决策。 特别是, 确定最佳剂量水平对于发展医疗制度具有关键作用。 但是,将现有RL算法适应医疗应用的一个挑战是, 流行的无限支持随机政策,例如高山政策, 可能会给患者分配风险很高的剂量和伤害。 因此, 重要的是要引导一个政策类别, 其支持只包含近乎最佳的行动, 并缩小行动搜索领域的有效性和可靠性。 为了实现这一目标, 我们开发了一个新的 emph{quasi- opmatimal 学习算法, 可以在非政策环境中很容易优化, 在一般功能近似下保证趋同。 从理论上讲, 我们分析拟议的算法的一致性、 抽样复杂性、 适应性和 趋同性。 我们用全面的模拟实验和剂量建议来评估我们的算法, 并真正应用俄亥俄州1型糖尿病数据集 。