Consider a prosthetic arm, learning to adapt to its user's control signals. We propose Interaction-Grounded Learning for this novel setting, in which a learner's goal is to interact with the environment with no grounding or explicit reward to optimize its policies. Such a problem evades common RL solutions which require an explicit reward. The learning agent observes a multidimensional context vector, takes an action, and then observes a multidimensional feedback vector. This multidimensional feedback vector has no explicit reward information. In order to succeed, the algorithm must learn how to evaluate the feedback vector to discover a latent reward signal, with which it can ground its policies without supervision. We show that in an Interaction-Grounded Learning setting, with certain natural assumptions, a learner can discover the latent reward and ground its policy for successful interaction. We provide theoretical guarantees and a proof-of-concept empirical evaluation to demonstrate the effectiveness of our proposed approach.
翻译:考虑一个假肢臂, 学习如何适应用户的控制信号。 我们为这个新颖的环境提出互动圈学习建议, 学习者的目标是与环境互动, 没有任何依据或明确的奖励来优化其政策。 这样的问题回避了需要明确奖励的共同RL解决方案。 学习者观察一个多维背景矢量, 采取行动, 然后观察一个多维反馈矢量。 这个多维反馈矢量没有明确的奖赏信息 。 为了成功, 算法必须学会如何评价反馈矢量, 以发现一个潜在的奖赏信号, 它可以在没有监督的情况下制定政策。 我们显示, 在互动圈学习环境中, 在某些自然假设下, 学习者可以发现潜在的奖赏, 并确立其成功互动的政策。 我们提供理论保障和概念性经验评估, 以证明我们拟议方法的有效性 。