Since its first appearance, transformers have been successfully used in wide ranging domains from computer vision to natural language processing. Application of transformers in Reinforcement Learning by reformulating it as a sequence modelling problem was proposed only recently. Compared to other commonly explored reinforcement learning problems, the Rubiks cube poses a unique set of challenges. The Rubiks cube has a single solved state for quintillions of possible configurations which leads to extremely sparse rewards. The proposed model CubeTR attends to longer sequences of actions and addresses the problem of sparse rewards. CubeTR learns how to solve the Rubiks cube from arbitrary starting states without any human prior, and after move regularisation, the lengths of solutions generated by it are expected to be very close to those given by algorithms used by expert human solvers. CubeTR provides insights to the generalisability of learning algorithms to higher dimensional cubes and the applicability of transformers in other relevant sparse reward scenarios.
翻译:自第一次出现以来,变压器在从计算机视觉到自然语言处理等广泛领域被成功使用。 仅在最近才提出将变压器作为序列建模问题在强化学习中重新配置, 将变压器作为序列建模问题在强化学习中应用。 与其他常见的强化学习问题相比, 鲁比克立方体提出了独特的挑战。 鲁比克立方体对可能导致极少奖励的可能配置的五分法具有单一的解决状态。 拟议的模型CubeTR 关注更长时间的行动序列, 并解决微弱的奖励问题。 CubeTR 学会了如何在没有人类前科的情况下从任意的起始国和移动后解决Rubik立方体的问题, 其产生的解决方案长度预计将与专家人类解析器使用的算法给出的解决方案非常接近。 CubeTR 提供了对高维立方体学习算法的通用性以及变压器在其他相关稀少奖励情景中的适用性。