Training neural networks with discrete stochastic variables presents a unique challenge. Backpropagation is not directly applicable, nor are the reparameterization tricks used in networks with continuous stochastic variables. To address this challenge, we present Hindsight Network Credit Assignment (HNCA), a novel gradient estimation algorithm for networks of discrete stochastic units. HNCA works by assigning credit to each unit based on the degree to which its output influences its immediate children in the network. We prove that HNCA produces unbiased gradient estimates with reduced variance compared to the REINFORCE estimator, while the computational cost is similar to that of backpropagation. We first apply HNCA in a contextual bandit setting to optimize a reward function that is unknown to the agent. In this setting, we empirically demonstrate that HNCA significantly outperforms REINFORCE, indicating that the variance reduction implied by our theoretical analysis is significant and impactful. We then show how HNCA can be extended to optimize a more general function of the outputs of a network of stochastic units, where the function is known to the agent. We apply this extended version of HNCA to train a discrete variational auto-encoder and empirically show it compares favourably to other strong methods. We believe that the ideas underlying HNCA can help stimulate new ways of thinking about efficient credit assignment in stochastic compute graphs.
翻译:具有离散随机变量的培训神经网络是一个独特的挑战。 反向调整不能直接适用,在具有连续随机变量的网络中使用的重新计量技巧也不直接适用。 为了应对这一挑战,我们介绍离散随机随机单元网络的新渐变估计算法(HNCA ) 。 HNCA根据其产出对网络中直接子子的影响程度为每个单位分配信用。 我们证明 HNCA 产生无偏向梯度估计值,与REINFORCE 估计器相比差异较小,而计算成本与反向调整相似。 我们首先在背景强势设置中应用 HNCA 优化代理所不知道的奖励功能。 在这种背景下,我们从经验上表明, HNCA 大大超越了REINFORCE, 表明我们理论分析所隐含的减少差异是显著和具有影响力的。 然后,我们证明HNCA 如何扩大一个更普通的计算器的输出功能, 来优化一个与REFORC 单位网络的更普通的输出功能, 而计算成本则类似于反向反向反向反向反向反向反向反向反向调整的计算器的计算器。 我们将这一功能应用了对高级智能分析工具的模型进行更快速的模型的模型的变换式的模型, 将这种分析方法展示了另一个的模型的模型的模型的模型的模型的模型的推导法。