This paper tackles the problem of learning value functions from undirected state-only experience (state transitions without action labels i.e. (s,s',r) tuples). We first theoretically characterize the applicability of Q-learning in this setting. We show that tabular Q-learning in discrete Markov decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space. This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience. Latent Action Q-learning (LAQ) learns value functions using Q-learning on discrete latent actions obtained through a latent-variable future prediction model. We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions. Value functions learned using LAQ lead to sample efficient acquisition of goal-directed behavior, can be used with domain-specific low-level controllers, and facilitate transfer across embodiments. Our experiments in 5 environments ranging from 2D grid world to 3D visual navigation in realistic environments demonstrate the benefits of LAQ over simpler alternatives, imitation learning oracles, and competing methods.
翻译:本文从不直接的国有经验( 状态转换而没有行动标签, 即 (s, s, r) tuples) 中处理学习价值函数的问题。 我们首先从理论上描述Q- 学习在这个环境中的适用性。 我们显示, 离散的 Markov 决策程序( MDPs) 的表格 Q 学习在任意改进动作空间下学习了同样的价值函数。 这一理论结果促使设计“ 静态行动” Q- 学习或 LAQ, 这是一种离线的 RL 方法, 能够从国有经验中学习有效的价值函数。 远程行动 Q- 学习( LAQ) 学习了数值函数, 使用通过潜在可变未来预测模型获得的离散的隐性行动。 我们显示, LAQ 可以回收与通过地面真相行动所学的价值函数高度关联的数值函数。 使用 LAQ 所学的价值观功能, 以抽样高效率地获得目标导向的行为, 可以用特定领域低级控制器来使用, 便利各种缩放。 我们在5个环境中进行的实验, 从 2D 电网格到3D 或现实环境中的更简单的视觉导航, 。