In real-world applications of reinforcement learning, it is often challenging to obtain a state representation that is parsimonious and satisfies the Markov property without prior knowledge. Consequently, it is common practice to construct a state which is larger than necessary, e.g., by concatenating measurements over contiguous time points. However, needlessly increasing the dimension of the state can slow learning and obfuscate the learned policy. We introduce the notion of a minimal sufficient state in a Markov decision process (MDP) as the smallest subvector of the original state under which the process remains an MDP and shares the same optimal policy as the original process. We propose a novel sequential knockoffs (SEEK) algorithm that estimates the minimal sufficient state in a system with high-dimensional complex nonlinear dynamics. In large samples, the proposed method controls the false discovery rate, and selects all sufficient variables with probability approaching one. As the method is agnostic to the reinforcement learning algorithm being applied, it benefits downstream tasks such as policy optimization. Empirical experiments verify theoretical results and show the proposed approach outperforms several competing methods in terms of variable selection accuracy and regret.
翻译:在强化学习的真实应用中,通常难以获得一个经济简洁且满足马尔可夫性的状态表达,缺乏先验知识。因此,常见的做法是构建一个比所需更大的状态,例如,通过在相邻时间点上连接多个测量值。然而,无谓地增加状态的维数会使学习变慢并且混淆所学的策略。本文引入马尔可夫决策过程(MDP)中最小的合适状态的概念,作为原始状态下的最小子向量,其中该过程仍然是MDP并共享与原始过程相同的最优策略。我们提出了一种新型的序列Knockoffs(SEEK)算法,用于在具有高维复杂非线性动力学的系统中估计最小的合适状态。在大样本情况下,所提出的方法控制了误发现率,并以接近1的概率选择所有足够的变量。由于方法对应用的强化学习算法不加区分,因此有助于下游任务,例如策略优化。经验实验验证了理论结果,并显示所提出的方法在变量选择准确性和遗憾方面优于几种竞争方法。