Appropriate credit assignment for delay rewards is a fundamental challenge for reinforcement learning. To tackle this problem, we introduce a delay reward calibration paradigm inspired from a classification perspective. We hypothesize that well-represented state vectors share similarities with each other since they contain the same or equivalent essential information. To this end, we define an empirical sufficient distribution, where the state vectors within the distribution will lead agents to environmental reward signals in the consequent steps. Therefore, a purify-trained classifier is designed to obtain the distribution and generate the calibrated rewards. We examine the correctness of sufficient state extraction by tracking the real-time extraction and building different reward functions in environments. The results demonstrate that the classifier could generate timely and accurate calibrated rewards. Moreover, the rewards are able to make the model training process more efficient. Finally, we identify and discuss that the sufficient states extracted by our model resonate with the observations of humans.
翻译:适当的延迟奖励信用分配是强化学习的基本挑战。 为了解决这一问题, 我们引入了从分类角度出发的延迟奖励校准模式。 我们假设代表性强的州矢量具有相似性,因为它们包含相同或同等的基本信息。 为此, 我们定义了一种经验充足的分配方式, 分配中的州矢量将在随后的步骤中导致环境奖励信号。 因此, 一个经过净化培训的分类器旨在获得分配并产生校准的奖励。 我们通过跟踪实时提取和在环境中建立不同的奖励功能来检查充分的国家提取的正确性。 结果表明, 分类器能够产生及时和准确的校准奖励。 此外, 奖励能够提高示范培训过程的效率。 最后, 我们确认并讨论, 我们模型中提取的足够国家与人类的观察结果相呼应。