As a new way of training generative models, Generative Adversarial Nets (GAN) that uses a discriminative model to guide the training of the generative model has enjoyed considerable success in generating real-valued data. However, it has limitations when the goal is for generating sequences of discrete tokens. A major reason lies in that the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is non-trivial to balance its current score and the future one once the entire sequence has been generated. In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines.
翻译:作为培训基因模型的一种新方式,使用歧视模式指导基因模型培训的基因网(GAN)在生成实际价值数据方面相当成功,但在生成离散物证序列的目标方面却有局限性。一个主要原因是,基因模型的离散产出使得很难将梯度更新从歧视模式传送到基因模型。另外,歧视模型只能评估完整的序列,而对于部分生成的序列来说,在生成整个序列后,它无法平衡其当前分数和未来分数。在本文件中,我们提议了一个序列生成框架,称为SeqGAN,以解决问题。将数据生成模型作为强化学习的随机化政策(RL),SeqGAN通过直接进行梯度政策更新,绕过发电机差异化问题。RL奖信号来自根据完整序列判断的GAN歧视者,在使用蒙特卡洛搜索的中间状态动作步骤之后,被转回回回回到中间状态。关于合成数据和实际世界的大规模基线实验展示重大改进。