Data-to-text generation systems aim to generate text descriptions based on input data (often represented in the tabular form). A typical system uses huge training samples for learning the correspondence between tables and texts. However, large training sets are expensive to obtain, limiting the applicability of these approaches in real-world scenarios. In this work, we focus on few-shot data-to-text generation. We observe that, while fine-tuned pretrained language models may generate plausible sentences, they suffer from the low semantic coverage problem in the few-shot setting. In other words, important input slots tend to be missing in the generated text. To this end, we propose a search-and-learning approach that leverages pretrained language models but inserts the missing slots to improve the semantic coverage. We further fine-tune our system based on the search results to smooth out the search noise, yielding better-quality text and improving inference efficiency to a large extent. Experiments show that our model achieves high performance on E2E and WikiBio datasets. Especially, we cover 98.35% of input slots on E2E, largely alleviating the low coverage problem.
翻译:数据到文本生成系统旨在根据输入数据生成文本描述文字(通常以表格形式表示)。典型的系统使用巨大的培训样本来学习表格和文本之间的对应关系。然而,大型培训成套系统费用昂贵,限制了这些方法在现实世界情景中的适用性。在这项工作中,我们侧重于微小数据到文本生成。我们发现,尽管经过微调的预先培训的语言模型可能会产生合理的句子,但它们在微小片片片段的语义覆盖上会遇到低层次的语义覆盖问题。换句话说,在生成的文本中,重要输入槽往往缺少。为此,我们建议采用搜索和学习方法,利用预先培训的语言模型,但插入缺失的空格来改进语义覆盖。我们进一步根据搜索结果微调我们的系统,以平滑搜索噪音,产生更高质量的文本,并在很大程度上提高推断效率。实验显示,我们的模型在E2E和WikiBio数据集上取得了很高的性能。特别是,我们覆盖了E2E的98.35%的输入槽,这在很大程度上减轻了低度问题。