Building a question-answering agent currently requires large annotated datasets, which are prohibitively expensive. This paper proposes Schema2QA, an open-source toolkit that can generate a Q&A system from a database schema augmented with a few annotations for each field. The key concept is to cover the space of possible compound queries on the database with a large number of in-domain questions synthesized with the help of a corpus of generic query templates. The synthesized data and a small paraphrase set are used to train a novel neural network based on the BERT pretrained model. We use Schema2QA to generate Q&A systems for five Schema.org domains, restaurants, people, movies, books and music, and obtain an overall accuracy between 64% and 75% on crowdsourced questions for these domains. Once annotations and paraphrases are obtained for a Schema.org schema, no additional manual effort is needed to create a Q&A agent for any website that uses the same schema. Furthermore, we demonstrate that learning can be transferred from the restaurant to the hotel domain, obtaining a 64% accuracy on crowdsourced questions with no manual effort. Schema2QA achieves an accuracy of 60% on popular restaurant questions that can be answered using Schema.org. Its performance is comparable to Google Assistant, 7% lower than Siri, and 15% higher than Alexa. It outperforms all these assistants by at least 18% on more complex, long-tail questions.
翻译:构建一个问答代理当前需要一个庞大的附加说明的数据集, 其成本高得令人难以接受。 本文建议使用 Schema2QA, 这是一种开放源码工具包, 可以从数据库的系统Schema2QA生成一个 ⁇ A 系统, 每个字段都配有一些说明。 关键的概念是覆盖数据库中可能的复合查询空间, 数据库中有大量的内部问题, 由一套通用查询模板加以合成。 综合数据和一个小插言集用于培训基于BERT预先培训模式的新颖的神经网络。 我们使用 Schema2QA 来生成一个用于五个 Schema. org 域、 餐馆、 人、 电影、 书籍和音乐的 & 开放源码系统, 并获得关于这些领域众源问题的总体精确度在64%和 75% 。 一旦Schema. org schema 的注释和副手语集集合成了大量的问题, 无需额外的人工工作为使用同一系统。 此外, 我们证明, 学习可以从餐厅到酒店域域域域, 64% 准确度为18 的Sirememma 问题, 的精确度为 。 。