Generative flow networks (GFlowNets) are a family of algorithms for training a sequential sampler of discrete objects under an unnormalized target density and have been successfully used for various probabilistic modeling tasks. Existing training objectives for GFlowNets are either local to states or transitions, or propagate a reward signal over an entire sampling trajectory. We argue that these alternatives represent opposite ends of a gradient bias-variance tradeoff and propose a way to exploit this tradeoff to mitigate its harmful effects. Inspired by the TD($\lambda$) algorithm in reinforcement learning, we introduce subtrajectory balance or SubTB($\lambda$), a GFlowNet training objective that can learn from partial action subsequences of varying lengths. We show that SubTB($\lambda$) accelerates sampler convergence in previously studied and new environments and enables training GFlowNets in environments with longer action sequences and sparser reward landscapes than what was possible before. We also perform a comparative analysis of stochastic gradient dynamics, shedding light on the bias-variance tradeoff in GFlowNet training and the advantages of subtrajectory balance.
翻译:源源网络(GFlowNets)是一组算法,用于在未经调整的目标密度下培训连续的离散物体取样员,并成功用于各种概率模型任务。GFlowNet的现有培训目标要么是当地到各州,要么是过渡,或者在整个取样轨迹上传播奖励信号。我们认为,这些替代方法代表了梯度偏差取舍的相反目的,并提出了一种利用这种取舍来减轻其有害影响的方法。在强化学习中的TD($\lambda$)算法的启发下,我们引入了亚字数平衡或子TB($\lambda$),这是GFNet的一个培训目标,可以从部分行动从不同长度的子序列中学习。我们表明,SubTB($\lambda$)加快了以前研究过和新环境的采样趋同,并使得GFowNet在比以前可能更长的行动序列和较稀少的奖励景观环境中对GFowNet进行训练。我们还对可辨度梯度梯度动态进行了比较分析,对亚网的偏差贸易优势进行了透度分析。