智能体视觉运动强化学习中的泛化问题：当学习难以达到时，请重启 (When Learning Is Out of Reach, Reset: Generalization in Autonomous Visuomotor Reinforcement Learning)

Episodic training, where an agent's environment is reset after every success or failure, is the de facto standard when training embodied reinforcement learning (RL) agents. The underlying assumption that the environment can be easily reset is limiting both practically, as resets generally require human effort in the real world and can be computationally expensive in simulation, and philosophically, as we'd expect intelligent agents to be able to continuously learn without intervention. Work in learning without any resets, i.e{.} Reset-Free RL (RF-RL), is promising but is plagued by the problem of irreversible transitions (e.g{.} an object breaking) which halt learning. Moreover, the limited state diversity and instrument setup encountered during RF-RL means that works studying RF-RL largely do not require their models to generalize to new environments. In this work, we instead look to minimize, rather than completely eliminate, resets while building visual agents that can meaningfully generalize. As studying generalization has previously not been a focus of benchmarks designed for RF-RL, we propose a new Stretch Pick-and-Place benchmark designed for evaluating generalizations across goals, cosmetic variations, and structural changes. Moreover, towards building performant reset-minimizing RL agents, we propose unsupervised metrics to detect irreversible transitions and a single-policy training mechanism to enable generalization. Our proposed approach significantly outperforms prior episodic, reset-free, and reset-minimizing approaches achieving higher success rates with fewer resets in Stretch-P\&P and another popular RF-RL benchmark. Finally, we find that our proposed approach can dramatically reduce the number of resets required for training other embodied tasks, in particular for RoboTHOR ObjectNav we obtain higher success rates than episodic approaches using 99.97\% fewer resets.

翻译：传统的强化学习训练方式是在智能体成功或失败之后将环境重置，即在每个情景中进行训练。但是这种重置的方式并不容易实现，特别是在现实世界中需要人工干预或在模拟环境中会消耗计算资源，同时这种方式也会限制智能体的连续学习能力。重置自由强化学习（RF-RL）是一种新的训练方法，它不需要进行重置，但是受到了不可逆转转换（例如物体断裂）所带来的限制，同时RF-RL训练过程中状态的多样性和仪器设置的有限性使得模型很难泛化到新的环境中。为此，本文提出了一种最小化重置的方法来构建能够具有显著泛化能力的视觉智能体。针对现有RF-RL基准测试无法解决泛化问题的状况，我们设计了一个新的Stretch Pick-and-Place基准测试来评估其泛化能力，包括目标、外观变化和结构变化等方面。此外，我们还提出了一种无监督度量方法来检测不可逆转转换，以及一种单策略训练机制来实现泛化。实验结果表明，我们所提出的方法在Stretch-P&P和其他流行的RF-RL基准测试中取得了显著优势，可以显著提高成功率且使用更少的重置次数。最后，我们发现我们的方法也可以极大地减少其他实体任务的训练重置次数，并在RoboTHOR ObjectNav任务中获得比传统的重置训练方法提高了99.97%成功率的结果。