Behavioural cloning is an imitation learning technique that teaches an agent how to behave via expert demonstrations. Recent approaches use self-supervision of fully-observable unlabelled snapshots of the states to decode state pairs into actions. However, the iterative learning scheme employed by these techniques is prone to get trapped into bad local minima. Previous work uses goal-aware strategies to solve this issue. However, this requires manual intervention to verify whether an agent has reached its goal. We address this limitation by incorporating a discriminator into the original framework, offering two key advantages and directly solving a learning problem previous work had. First, it disposes of the manual intervention requirement. Second, it helps in learning by guiding function approximation based on the state transition of the expert's trajectories. Third, the discriminator solves a learning issue commonly present in the policy model, which is to sometimes perform a `no action' within the environment until the agent finally halts.
翻译:行为克隆是一种通过专家演示教授代理如何行动的模仿学习技术。最近的方法利用完全可观察的未标记状态快照的自监督来将状态对解码为动作。然而,这些技术采用的迭代学习方案容易陷入坏的局部最小值。以前的工作使用目标感知策略来解决这个问题。然而,这需要人工干预来验证代理是否达到了其目标。我们通过将鉴别器纳入原始框架来解决这个限制,提供了两个关键优势,并直接解决了以前的工作中存在的一个学习问题。首先,它摆脱了人工干预的需求。其次,在学习过程中,它通过指导专家轨迹的状态转换来帮助学习。第三,判别器解决了策略模型常常在环境中执行“无动作”直到代理最终停止的一个学习问题。