We introduce Dynamic Contextual Markov Decision Processes (DCMDPs), a novel reinforcement learning framework for history-dependent environments that generalizes the contextual MDP framework to handle non-Markov environments, where contexts change over time. We consider special cases of the model, with a focus on logistic DCMDPs, which break the exponential dependence on history length by leveraging aggregation functions to determine context transitions. This special structure allows us to derive an upper-confidence-bound style algorithm for which we establish regret bounds. Motivated by our theoretical results, we introduce a practical model-based algorithm for logistic DCMDPs that plans in a latent space and uses optimism over history-dependent features. We demonstrate the efficacy of our approach on a recommendation task (using MovieLens data) where user behavior dynamics evolve in response to recommendations.
翻译:我们引入了动态背景Markov 决策程序(DCMDPs),这是一个具有历史依赖性的环境的新强化学习框架,它概括了背景的 MDP 框架,以处理环境随时间变化而变化的非马尔科夫环境。我们考虑了模型的特殊案例,重点是后勤的 DCMDPs,它通过利用聚合功能决定背景转型,打破了对历史长度的指数依赖。这一特殊结构使我们能够获得一种具有高度信任性的算法,我们为此建立了悔恨界限。根据我们的理论结果,我们引入了一种基于逻辑的基于模型的逻辑计算法,用于在潜在空间中进行规划,并使用对历史依赖特征的乐观。我们展示了我们对建议任务(使用MovesLens数据)的处理方法的有效性,即用户行为动态根据建议演变。