Current textual question answering models achieve strong performance on in-domain test sets, but often do so by fitting surface-level patterns in the data, so they fail to generalize to out-of-distribution settings. To make a more robust and understandable QA system, we model question answering as an alignment problem. We decompose both the question and context into smaller units based on off-the-shelf semantic representations (here, semantic roles), and align the question to a subgraph of the context in order to find the answer. We formulate our model as a structured SVM, with alignment scores computed via BERT, and we can train end-to-end despite using beam search for approximate inference. Our explicit use of alignments allows us to explore a set of constraints with which we can prohibit certain types of bad model behavior arising in cross-domain settings. Furthermore, by investigating differences in scores across different potential answers, we can seek to understand what particular aspects of the input lead the model to choose the answer without relying on post-hoc explanation techniques. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets. The results show that our model is more robust cross-domain than the standard BERT QA model, and constraints derived from alignment scores allow us to effectively trade off coverage and accuracy.
翻译:目前回答文本问题的模型在内部测试组上表现良好,但往往通过在数据中安装表层模式来做到这一点,因此它们无法概括到分布外设置。为了建立更稳健和易懂的QA系统,我们将问题和上下文作为对齐问题解说。我们根据现成的语义表达方式(这里,语义作用)将问题和上下文分解成较小的单元,并将问题与上下文分解,以便找到答案。我们把模型设计成结构化的SVM,通过BERT计算校准分数,因此,我们无法对端对端进行培训,尽管我们只是用对近似推算法的搜索来培训端对端。我们明确使用对齐方法可以探索一系列制约,从而我们可以禁止跨界环境中出现某些类型的不良模式行为。此外,通过调查不同潜在答案的分数差异,我们可以了解输入模式的具体方面导致选择答案,而不必依赖后方解释技术。我们用SQADV1.1模型来培训最终到端端端对端,尽管我们使用了对准的近距离的推断方法。我们明确使用了一组对齐度搜索模型,我们用来测试了比标准标准标准的对齐和对准度测试了我们的标准对准和对标度的对准度,我们的标准对准度的对准度测试结果。我们从几级和对准度显示了比标准的对准度和对准度的对准度。