We introduce Synthetic Environments (SEs) and Reward Networks (RNs), represented by neural networks, as proxy environment models for training Reinforcement Learning (RL) agents. We show that an agent, after being trained exclusively on the SE, is able to solve the corresponding real environment. While an SE acts as a full proxy to a real environment by learning about its state dynamics and rewards, an RN is a partial proxy that learns to augment or replace rewards. We use bi-level optimization to evolve SEs and RNs: the inner loop trains the RL agent, and the outer loop trains the parameters of the SE / RN via an evolution strategy. We evaluate our proposed new concept on a broad range of RL algorithms and classic control environments. In a one-to-one comparison, learning an SE proxy requires more interactions with the real environment than training agents only on the real environment. However, once such an SE has been learned, we do not need any interactions with the real environment to train new agents. Moreover, the learned SE proxies allow us to train agents with fewer interactions while maintaining the original task performance. Our empirical results suggest that SEs achieve this result by learning informed representations that bias the agents towards relevant states. Moreover, we find that these proxies are robust against hyperparameter variation and can also transfer to unseen agents.
翻译:我们引入了以神经网络为代表的合成环境和奖励网络(SES),作为培训强化学习代理商的代理环境模型。我们表明,在完全接受SE培训之后,一个代理商能够解决相应的真实环境。虽然SE通过了解其状态动态和奖赏而完全替代真实环境,但RN是一种部分代理,可以学习增加或取代奖励。我们使用双级优化来发展SE和RN:内部循环培训RL代理商,外部循环通过演进战略来培训SE/RN参数。我们评估了我们提议的关于广泛的RL算法和经典控制环境的新概念。在一对一的比较中,学习SE代理需要与真实环境的完全替代,而不是培训代理商在真实环境中学习。但是,一旦学会了这样的SE,我们不需要与真实环境进行任何互动来培训新的代理商。此外,所学的SE代理商通过进化战略,可以让我们在保持原始任务周期性表现的同时对SE/RN进行较少的互动。我们的经验显示,我们也可以通过这种可靠的代理商进行这样的学习。