VANilLLa : 大规模自然语言的口头回答 (VANiLLa : Verbalized Answers in Natural Language at Large Scale)

In the last years, there have been significant developments in the area of Question Answering over Knowledge Graphs (KGQA). Despite all the notable advancements, current KGQA datasets only provide the answers as the direct output result of the formal query, rather than full sentences incorporating question context. For achieving coherent answers sentence with the question's vocabulary, template-based verbalization so are usually employed for a better representation of answers, which in turn require extensive expert intervention. Thus, making way for machine learning approaches; however, there is a scarcity of datasets that empower machine learning models in this area. Hence, we provide the VANiLLa dataset which aims at reducing this gap by offering answers in natural language sentences. The answer sentences in this dataset are syntactically and semantically closer to the question than to the triple fact. Our dataset consists of over 100k simple questions adapted from the CSQA and SimpleQuestionsWikidata datasets and generated using a semi-automatic framework. We also present results of training our dataset on multiple baseline models adapted from current state-of-the-art Natural Language Generation (NLG) architectures. We believe that this dataset will allow researchers to focus on finding suitable methodologies and architectures for answer verbalization.

翻译：过去几年来,在知识图的问答(KGQA)领域取得了显著进展。尽管取得了各种显著的进步,但当前的 KGQA 数据集仅作为正式查询的直接输出结果提供答案,而不是包含问题背景的完整句子。为了用问题词汇获得一致的回答句,通常采用基于模板的口头表述来更好地表述答案,而这反过来又需要广泛的专家干预。因此,为机器学习方法开辟了道路;然而,为在这一领域的机器学习模型提供了赋权的数据集。因此,我们提供了VANilla数据集,目的是通过在自然语言句中提供答案来缩小这一差距。这一数据集的回答句与问题比三重更近,在语义上更接近。我们的数据集由100多个简单的问题组成,它们来自CSQA和简单问题Wikiddata数据集,并使用半自动框架生成。我们还介绍了我们关于从当前状态和艺术搜索的自然语言结构中调整的多条基线模型的培训结果,我们相信,我们将让研究人员能够找到合适的语言结构。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【EMNLP2020】自然语言生成，Neural Language Generation

专知会员服务

39+阅读 · 2020年11月20日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日