For many new application domains for data-to-text generation, the main obstacle in training neural models consists of a lack of training data. While usually large numbers of instances are available on the data side, often only very few text samples are available. To address this problem, we here propose a novel few-shot approach for this setting. Our approach automatically augments the data available for training by (i) generating new text samples based on replacing specific values by alternative ones from the same category, (ii) generating new text samples based on GPT-2, and (iii) proposing an automatic method for pairing the new text samples with data samples. As the text augmentation can introduce noise to the training data, we use cycle consistency as an objective, in order to make sure that a given data sample can be correctly reconstructed after having been formulated as text (and that text samples can be reconstructed from data). On both the E2E and WebNLG benchmarks, we show that this weakly supervised training paradigm is able to outperform fully supervised seq2seq models with less than 10% annotations. By utilizing all annotated data, our model can boost the performance of a standard seq2seq model by over 5 BLEU points, establishing a new state-of-the-art on both datasets.
翻译:对于数据到文字生成的许多新的应用领域,培训神经模型的主要障碍是缺乏培训数据。虽然通常在数据方面有大量实例,但通常只有很少的文本样本。为了解决这一问题,我们在此建议对这一设置采取新的片断方法。我们的方法通过(一) 以同一类别的替代值取代特定值来生成新的文本样本,从而自动增加培训数据,(二) 以GPT-2为基础生成新的文本样本,(三) 提出一种自动方法,将新文本样本与数据样本配对。由于文本增强可以给培训数据带来噪音,我们使用周期一致性作为一个目标,以确保在将特定数据样本作为文本之后能够正确重建(文本样本可以从数据中重建)。关于E2E和WebNLG基准,我们表明这一薄弱的监管培训模式能够超越完全监督的后2Seq模型,以不到10%的附加说明。通过利用所有附加说明的数据,我们的模式可以提高标准后2A值的性能,通过BL2qs 建立新的数据模型。