Most existing policy learning solutions require the learning agents to receive high-quality supervision signals such as well-designed rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC). These quality supervisions are usually infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervision" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our approach explicitly punishes a policy for overfitting to the weak supervision. In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high.
翻译:多数现有的政策学习解决方案要求学习机构接受高质量的监督信号,如强化学习(RL)或行为克隆(BC)方面的高质量专家示范活动(BC)中设计良好的奖励。这些质量监督通常不可行,实际上成本太高。我们的目标是建立一个统一框架,利用现有的廉价薄弱监督来有效开展政策学习。为了解决这一问题,我们把“薄弱监督”视为来自同侪机构不完善的信息,并根据与同行机构的政策(而不是简单的协议)的“相关协议”来评估学习机构的政策。我们的方法明确惩罚过度适应薄弱监督的政策。除了理论保障外,对包括高声奖励的RL任务、微弱演示的BC任务和标准政策共同培训的广泛评价表明,我们的方法可以大大改进业绩,特别是在学习环境的复杂性或噪音很高的情况下。