强化学习在线算法是否进行了个性化？一种使用重新抽样的评估方法 (Did we personalize? Assessing personalization by an online reinforcement learning algorithm using resampling)

There is a growing interest in using reinforcement learning (RL) to personalize sequences of treatments in digital health to support users in adopting healthier behaviors. Such sequential decision-making problems involve decisions about when to treat and how to treat based on the user's context (e.g., prior activity level, location, etc.). Online RL is a promising data-driven approach for this problem as it learns based on each user's historical responses and uses that knowledge to personalize these decisions. However, to decide whether the RL algorithm should be included in an ``optimized'' intervention for real-world deployment, we must assess the data evidence indicating that the RL algorithm is actually personalizing the treatments to its users. Due to the stochasticity in the RL algorithm, one may get a false impression that it is learning in certain states and using this learning to provide specific treatments. We use a working definition of personalization and introduce a resampling-based methodology for investigating whether the personalization exhibited by the RL algorithm is an artifact of the RL algorithm stochasticity. We illustrate our methodology with a case study by analyzing the data from a physical activity clinical trial called HeartSteps, which included the use of an online RL algorithm. We demonstrate how our approach enhances data-driven truth-in-advertising of algorithm personalization both across all users as well as within specific users in the study.

翻译：越来越多的人们对使用强化学习 (RL) 来个性化数字健康中的治疗序列产生了兴趣，以帮助用户采取更健康的行为。这样的序列决策问题涉及何时治疗以及如何治疗，这些都是基于用户的上下文 (例如，以前的活动水平、位置等) 进行的决策。在线 RL 是这个问题的一种有前途的数据驱动方法，因为它基于每个用户的历史响应进行学习，并使用这些知识来个性化这些决策。然而，要决定是否应将 RL 算法包括在实际部署的“优化”干预中，我们必须评估数据证据，表明 RL 算法实际上正在个性化对其用户的治疗。由于在 RL 算法中的随机性，人们可能会得出一个错误的印象，即它正在某些状态下学习，并使用此学习来提供特定的治疗。我们使用个性化的工作定义，并引入一种基于重新抽样的方法来调查 RL 算法表现出的个性化是否是 RL 算法随机性的产物。我们通过一个案例研究来说明我们的方法，通过分析一个称为 HeartSteps 的体育活动临床试验的数据，该试验包括使用在线 RL 算法。我们展示了我们的方法如何提高在整个用户群体以及在研究中的特定用户之间的基于数据的算法个性化真相。