学习无外部奖励的强化学习中的不透明行为 (Learning Altruistic Behaviours in Reinforcement Learning without External Rewards)

Can artificial agents learn to assist others in achieving their goals without knowing what those goals are? Generic reinforcement learning agents could be trained to behave altruistically towards others by rewarding them for altruistic behaviour, i.e., rewarding them for benefiting other agents in a given situation. Such an approach assumes that other agents' goals are known so that the altruistic agent can cooperate in achieving those goals. However, explicit knowledge of other agents' goals is often difficult to acquire. Even assuming such knowledge to be given, training of altruistic agents would require manually-tuned external rewards for each new environment. Thus, it is beneficial to develop agents that do not depend on external supervision and can learn altruistic behaviour in a task-agnostic manner. Assuming that other agents rationally pursue their goals, we hypothesize that giving them more choices will allow them to pursue those goals better. Some concrete examples include opening a door for others or safeguarding them to pursue their objectives without interference. We formalize this concept and propose an altruistic agent that learns to increase the choices another agent has by maximizing the number of states that the other agent can reach in its future. We evaluate our approach on three different multi-agent environments where another agent's success depends on the altruistic agent's behaviour. Finally, we show that our unsupervised agents can perform comparably to agents explicitly trained to work cooperatively. In some cases, our agents can even outperform the supervised ones.

翻译：人工代理商能否在不知道这些目标的情况下学会帮助他人实现其目标? 普通强化学习代理商可以通过以利他主义行为奖励他们,即奖励他们在特定情况下使其他代理商受益,从而训练他们以利他主义对待他人; 这种方法假定其他代理商的目标为人知,以便利他主义代理商能够合作实现这些目标。但是,对其他代理商目标的明确了解往往很难获得。即使假设会提供这种知识,对利他主义代理商的培训也需要人工调整外部奖励,以适应每个新环境。因此,发展不依赖外部监督并能以任务不可知的方式学习利他主义行为的代理商是有益的。假设其他代理商在合理追求目标的同时,我们假设给他们更多的选择将使他们能够更好地追求这些目标。一些具体的例子包括为其他人打开大门或保护他们追求目标而不受干扰。我们正式确定这一概念,并提议一个利他主义代理商,学会增加对每个新环境的选择。因此,开发一个不依赖外部监管的代理商数量,从而能够最大限度地增加其他代理商的州数量,从而让其他代理商在未来有不同的行为。