We introduce MilkQA, a question answering dataset from the dairy domain dedicated to the study of consumer questions. The dataset contains 2,657 pairs of questions and answers, written in the Portuguese language and originally collected by the Brazilian Agricultural Research Corporation (Embrapa). All questions were motivated by real situations and written by thousands of authors with very different backgrounds and levels of literacy, while answers were elaborated by specialists from Embrapa's customer service. Our dataset was filtered and anonymized by three human annotators. Consumer questions are a challenging kind of question that is usually employed as a form of seeking information. Although several question answering datasets are available, most of such resources are not suitable for research on answer selection models for consumer questions. We aim to fill this gap by making MilkQA publicly available. We study the behavior of four answer selection models on MilkQA: two baseline models and two convolutional neural network archictetures. Our results show that MilkQA poses real challenges to computational models, particularly due to linguistic characteristics of its questions and to their unusually longer lengths. Only one of the experimented models gives reasonable results, at the cost of high computational requirements.
翻译:我们引入了MilkQA, 这个问题解答了专门研究消费者问题的乳制品领域的数据集。该数据集包含2,657对问题和答案,这些问答是以葡萄牙语撰写的,最初由巴西农业研究公司(Embrapa)收集。所有问题都是由真实情况驱动的,由具有不同背景和识字水平的数千名作者撰写,而答案则由Embrapa客户服务的专家编写。我们的数据集由3名人类告示员过滤和匿名。消费者问题是一个具有挑战性的问题,通常被用作一种寻求信息的形式。尽管有几个问题回答数据集,但大多数这类资源不适合对消费者问题答案选择模式的研究。我们的目标是通过公布MilkQA来填补这一空白。我们研究了四个答案选择模型在MilkQA上的行为:两个基线模型和两个革命神经网络的古老。我们的结果显示,MilkQA对计算模型提出了真正的挑战,特别是由于其问题的语言特征及其异常长的长度。只有一个实验模型在高的成本上给出了合理的计算结果。