Training neural networks with discrete stochastic variables presents a unique challenge. Backpropagation is not directly applicable, nor are the reparameterization tricks used in networks with continuous stochastic variables. To address this challenge, we present Hindsight Network Credit Assignment (HNCA), a novel learning algorithm for networks of discrete stochastic units. HNCA works by assigning credit to each unit based on the degree to which its output influences its immediate children in the network. We prove that HNCA produces unbiased gradient estimates with reduced variance compared to the REINFORCE estimator, while the computational cost is similar to that of backpropagation. We first apply HNCA in a contextual bandit setting to optimize a reward function that is unknown to the agent. In this setting, we empirically demonstrate that HNCA significantly outperforms REINFORCE, indicating that the variance reduction implied by our theoretical analysis is significant and impactful. We then show how HNCA can be extended to optimize a more general function of the outputs of a network of stochastic units, where the function is known to the agent. We apply this extended version of HNCA to train a discrete variational auto-encoder and empirically show it compares favourably to other strong methods. We believe that the ideas underlying HNCA can help stimulate new ways of thinking about efficient credit assignment in stochastic compute graphs.
翻译:具有离散随机变量的培训神经网络是一个独特的挑战。 背向偏移不能直接适用,在具有连续随机变量的网络中使用的重新计量技巧也不能直接适用。 为了应对这一挑战,我们展示了离散随机网络信用分配(HNCA)的新学习算法(HNCA),这是离散随机网络单位网络的一种新型学习算法。 HNCA根据其输出对网络中直接子子的影响程度,为每个单位分配信用。我们证明, HNCA 生成了无偏向梯度估计,与REINFORCE 估计器相比差异较小,而计算成本与反向反偏移相近。我们首先在背景强盗行设置中应用 HNCA 优化奖励功能,以优化代理商所不知道的奖励功能。我们从经验上证明, HNCA 大大超越了REINFORCE, 表明我们理论分析所隐含的差异减少幅度是显著和具有影响力的。 然后,我们证明HNCA 如何扩大一个更普通的网络产出功能,以优化的通用功能,,在这个网络中, 其功能可以被理解为对QANCA 的快速智能变换动。