We study provably-efficient reinforcement learning in non-episodic factored Markov decision processes (FMDPs). All previous regret minimization algorithms in this setting made the strong assumption that the factored structure of the FMDP is known to the learner in advance. In this paper, we provide the first algorithm that learns the structure of the FMDP while minimizing the regret. Our algorithm is based on the optimism in face of uncertainty principle, combined with a simple statistical method for structure learning, and can be implemented efficiently given oracle-access to an FMDP planner. In addition, we give a variant of our algorithm that remains efficient even when the oracle is limited to non-factored actions, which is the case with almost all existing approximate planners. Finally, we also provide a novel lower bound for the known structure case that matches the best known regret bound of Chen et al. (2020).
翻译:我们研究的是非考虑因素的Markov 决策程序(FMDPs)中高效的强化学习。 之前所有在这种环境下的遗憾最小化算法都强烈假定学习者事先知道FMDP的因子结构。 在本文中,我们提供了第一个在尽量减少遗憾的同时学习FMDP结构的算法。 我们的算法基于面对不确定性原则的乐观态度,加上结构学习的简单统计方法,并且可以高效地实施,给一个FMDP规划者以机会或触手可及的机会。 此外,我们给出了我们的算法的变式,即使在神器限于非因素行动的情况下,仍然有效,几乎所有现有的近似规划者都如此。最后,我们还为已知的结构案例提供了新的较低约束,它与陈等人(2020年)最著名的悔恨相符。