This paper is about the problem of learning a stochastic policy for generating an object (like a molecular graph) from a sequence of actions, such that the probability of generating an object is proportional to a given positive reward for that object. Whereas standard return maximization tends to converge to a single return-maximizing sequence, there are cases where we would like to sample a diverse set of high-return solutions. These arise, for example, in black-box function optimization when few rounds are possible, each with large batches of queries, where the batches should be diverse, e.g., in the design of new molecules. One can also see this as a problem of approximately converting an energy function to a generative distribution. While MCMC methods can achieve that, they are expensive and generally only perform local exploration. Instead, training a generative policy amortizes the cost of search during training and yields to fast generation. Using insights from Temporal Difference learning, we propose GFlowNet, based on a view of the generative process as a flow network, making it possible to handle the tricky case where different trajectories can yield the same final state, e.g., there are many ways to sequentially add atoms to generate some molecular graph. We cast the set of trajectories as a flow and convert the flow consistency equations into a learning objective, akin to the casting of the Bellman equations into Temporal Difference methods. We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution, and demonstrate the improved performance and diversity of GFlowNet on a simple domain where there are many modes to the reward function, and on a molecule synthesis task.
翻译:本文涉及从一系列动作中学习生成对象( 如分子图) 的随机化政策的问题, 使生成对象的概率与该对象的给定的积极奖赏成比例。 虽然标准的返回最大化往往会趋同到单一的回报最大化序列, 但有些情况下, 我们想在黑盒函数优化中抽样一组不同的高回报解决方案。 例如, 在黑盒函数优化中, 当可能进行几轮测试时, 每批都有大量的查询, 批量应该多样化, 比如在设计新分子时。 人们也可以将生成对象的概率视为一个将一个能量函数转换为给该对象的给定的积极性分布。 虽然标准的返回最大化往往会趋同于一个单一的回报最大化序列序列。 相反, 培训一种基因化政策可以将培训和收益快速生成过程中的搜索成本集中起来。 利用TemoralDior 学习的洞察, 我们建议GFloowNet, 以任何组合化过程作为流动网络, 使得它有可能处理一个复杂的案例, 将一个能量性能将一个能量函数转换到一个简单的直径流, 直径直径直到一个直径直到直径直方的直方形的直方值, 。 我们的直方形的直方形的直方形的直径方程式将一个方向到直到直方形的计算到直方形, 。