Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on the task of descibing a relation between two entities. For our experiments, we collect a novel dataset for verbalizing a diverse set of 1,522 unique relations from three large-scale knowledge graphs (Wikidata, DBPedia, YAGO). We find that although PLMs for D2T generation expectedly fail on unclear cases, models trained with a large variety of relation labels are surprisingly robust in verbalizing novel, unseen relations. We argue that using data with a diverse set of clear and meaningful labels is key to training D2T generation systems capable of generalizing to novel domains.
翻译:数据到文字( D2T) 生成的预选语言模型( PLMs), 可以使用人类可读数据标签, 如列标题、 键或关联名称, 将其概括为外域示例。 但是, 如果这些标签模糊不清或不完整, 这些模型在生成语义不准确的产出时是众所周知的, D2T 数据集中通常就是这种情况。 在本文中, 我们揭示了这两个实体之间脱钩的任务的这一问题。 在我们的实验中, 我们收集了一套新颖的数据集, 用于从三个大型知识图表( 维基数据、 DBBedia、 YAGO) 中对不同的1,522个独特关系进行口头表达。 我们发现, 尽管 D2T 生成系统( 维基数据、 DBBedia、 YAGO) 的 PLMS 预计将在未知的案例中失败, 但经过大量关系标签培训的模型在新颖、 隐蔽关系中是惊人的。 我们说, 使用具有多种明确和有意义的标签的数据是培训D2T 生成系统的关键 。