对符号合理理由所在域进行差异强化学习 (Contrastive Reinforcement Learning of Symbolic Reasoning Domains)

Abstract symbolic reasoning, as required in domains such as mathematics and logic, is a key component of human intelligence. Solvers for these domains have important applications, especially to computer-assisted education. But learning to solve symbolic problems is challenging for machine learning algorithms. Existing models either learn from human solutions or use hand-engineered features, making them expensive to apply in new domains. In this paper, we instead consider symbolic domains as simple environments where states and actions are given as unstructured text, and binary rewards indicate whether a problem is solved. This flexible setup makes it easy to specify new domains, but search and planning become challenging. We introduce four environments inspired by the Mathematics Common Core Curriculum, and observe that existing Reinforcement Learning baselines perform poorly. We then present a novel learning algorithm, Contrastive Policy Learning (ConPoLe) that explicitly optimizes the InfoNCE loss, which lower bounds the mutual information between the current state and next states that continue on a path to the solution. ConPoLe successfully solves all four domains. Moreover, problem representations learned by ConPoLe enable accurate prediction of the categories of problems in a real mathematics curriculum. Our results suggest new directions for reinforcement learning in symbolic domains, as well as applications to mathematics education.

翻译：数学和逻辑等领域要求的抽象抽象推理是人类智能的关键组成部分。这些领域的解决方案具有重要的应用, 特别是计算机辅助教育。但是, 学习解决象征性问题对于机器学习算法来说是挑战的。现有的模型要么从人类的解决方案中学习,要么使用手工设计的特性,使其在新领域应用成本很高。在本文中,我们反而将象征性域视为简单的环境,将状态和行动作为无结构文本给予,二进制奖励则表明问题是否得到解决。这个灵活的设置使得能够很容易指定新的领域,但搜索和规划变得具有挑战性。我们引入了四个受数学共同核心课程启发的环境,并观察到现有的加强学习基线运行不良。我们随后提出了一个新的学习算法, 对比政策学习(ConPoLe), 明确优化InfoNCE损失, 将当前状态和下一个状态之间的相互信息作为不结构化文本, 和继续沿着解决方案前进的道路连接起来的二进奖项。 ConPoLe 成功地解决了所有四个领域。此外, ConPoLe 所了解的问题表达方式能够准确预测真正数学课程中的问题类别。我们的数学应用新方向建议加强符号学领域的数学应用。