Bandit algorithms are widely used in sequential decision problems to maximize the cumulative reward. One potential application is mobile health, where the goal is to promote the user's health through personalized interventions based on user specific information acquired through wearable devices. Important considerations include the type of, and frequency with which data is collected (e.g. GPS, or continuous monitoring), as such factors can severely impact app performance and users' adherence. In order to balance the need to collect data that is useful with the constraint of impacting app performance, one needs to be able to assess the usefulness of variables. Bandit feedback data are sequentially correlated, so traditional testing procedures developed for independent data cannot apply. Recently, a statistical testing procedure was developed for the actor-critic bandit algorithm. An actor-critic algorithm maintains two separate models, one for the actor, the action selection policy, and the other for the critic, the reward model. The performance of the algorithm as well as the validity of the test are guaranteed only when the critic model is correctly specified. However, misspecification is frequent in practice due to incorrect functional form or missing covariates. In this work, we propose a modified actor-critic algorithm which is robust to critic misspecification and derive a novel testing procedure for the actor parameters in this case.
翻译:连续决策问题中广泛使用土匪算法,以最大限度地增加累积报酬。一个潜在的应用是移动健康,目的是通过基于用户特定信息通过可磨损装置获得的个性化干预,促进用户的健康。重要的考虑因素包括数据收集的种类和频率(例如全球定位系统,或持续监测),这些因素会严重影响应用的性能和用户的遵守。为了平衡收集有用数据的必要性与影响应用程序性能的制约,人们需要能够评估变量的效用。盗匪反馈数据是相继关联的,因此为独立数据开发的传统测试程序无法适用。最近,为演员-批评性土匪算法制定了统计测试程序。一个演员-批评性算法维持两种不同的模型,一种是演员,一种是行动选择政策,另一种是批评者,一种是奖赏模式。只有在批评者模型得到正确说明时,才能保证算法的性以及测试的有效性。但是,由于功能形式不正确或缺少可调和性,因此在实践中经常出现误差,因此无法适用传统的测试程序。最近,为演员-批评性强型算法,我们提议对新式的演法进行稳性测试。