通过在分配信贷时保持自主性的方法进行强化学习的模块性 (Modularity in Reinforcement Learning via Algorithmic Independence in Credit Assignment)

from arxiv, Long Presentation at the Thirty-eighth International Conference on Machine Learning (ICML) 2021. 21 pages, 11 figures. v2: updated acknowledgments. v3: clarified that the internal function nodes of the credit assignment mechanism are not considered O(1)

Many transfer problems require re-using previously optimal decisions for solving new tasks, which suggests the need for learning algorithms that can modify the mechanisms for choosing certain actions independently of those for choosing others. However, there is currently no formalism nor theory for how to achieve this kind of modular credit assignment. To answer this question, we define modular credit assignment as a constraint on minimizing the algorithmic mutual information among feedback signals for different decisions. We introduce what we call the modularity criterion for testing whether a learning algorithm satisfies this constraint by performing causal analysis on the algorithm itself. We generalize the recently proposed societal decision-making framework as a more granular formalism than the Markov decision process to prove that for decision sequences that do not contain cycles, certain single-step temporal difference action-value methods meet this criterion while all policy-gradient methods do not. Empirical evidence suggests that such action-value methods are more sample efficient than policy-gradient methods on transfer problems that require only sparse changes to a sequence of previously optimal decisions.

翻译：许多转移问题需要重新使用先前最优化的决定来解决新任务,这表明需要学习算法,这种算法可以改变选择某些行动的机制,而不必选择其他行动。然而,目前对于如何实现这种模块化信贷分配,没有形式主义或理论。为了回答这个问题,我们将模块化信贷分配定义为限制将不同决定的反馈信号之间的算法相互信息最小化。我们引入了我们称之为模块化的标准,用于测试学习算法是否通过对算法本身进行因果关系分析来满足这一制约。我们将最近提出的社会决策框架概括为比马尔科夫决策过程更为简单化的形式主义,以证明对于不包含周期的决定序列而言,某些单步时间差异行动价值方法符合这一标准,而所有政策优先性方法则不满足这一标准。有经验的证据表明,这种行动价值方法比仅要求对先前最优化的决定序列进行微小的改变的转移问题政策调整方法更具有代表性。