The task of Question Answering has gained prominence in the past few decades for testing the ability of machines to understand natural language. Large datasets for Machine Reading have led to the development of neural models that cater to deeper language understanding compared to information retrieval tasks. Different components in these neural architectures are intended to tackle different challenges. As a first step towards achieving generalization across multiple domains, we attempt to understand and compare the peculiarities of existing end-to-end neural models on the Stanford Question Answering Dataset (SQuAD) by performing quantitative as well as qualitative analysis of the results attained by each of them. We observed that prediction errors reflect certain model-specific biases, which we further discuss in this paper.
翻译:问题回答的任务在过去几十年中在测试机器理解自然语言的能力方面越来越突出,机器阅读的大型数据集导致神经模型的开发,与信息检索任务相比,这些神经模型有助于更深入的语言理解。这些神经结构的不同组成部分旨在应对不同的挑战。作为在多个领域实现普遍化的第一步,我们试图通过从数量和质量上分析每个模型取得的结果来理解和比较斯坦福问题回答数据集(SQUAD)现有端到端神经模型的特殊性。我们观察到,预测错误反映了某些特定模型的偏差,我们在本文件中进一步讨论了这些偏差。