We present a new class of structured reinforcement learning policy-architectures, Implicit Two-Tower (ITT) policies, where the actions are chosen based on the attention scores of their learnable latent representations with those of the input states. By explicitly disentangling action from state processing in the policy stack, we achieve two main goals: substantial computational gains and better performance. Our architectures are compatible with both: discrete and continuous action spaces. By conducting tests on 15 environments from OpenAI Gym and DeepMind Control Suite, we show that ITT-architectures are particularly suited for blackbox/evolutionary optimization and the corresponding policy training algorithms outperform their vanilla unstructured implicit counterparts as well as commonly used explicit policies. We complement our analysis by showing how techniques such as hashing and lazy tower updates, critically relying on the two-tower structure of ITTs, can be applied to obtain additional computational improvements.
翻译:我们提出了一个新的结构强化学习政策架构类别,即隐性双图人(ITT)政策,其中选择的行动是基于其可学习的潜在代表与输入国代表的分数的注意分数。我们通过在政策堆中明确将行动与国家处理过程脱钩,实现了两个主要目标:巨大的计算收益和更好的性能。我们的建筑既与离散又连续的行动空间兼容。我们通过对OpenAI Gym 和 DeepMind Control 套件的15个环境进行测试,我们表明,ITT-结构特别适合黑盒/进化优化,相应的政策培训算法超越了其香草无结构的隐含对应方以及常用的明确政策。我们补充我们的分析,通过展示如何应用诸如收缩和懒惰塔式更新等技术,严格依靠ITTT的二至下结构,以获得额外的计算改进。