Question answering (QA) models are shown to be insensitive to large perturbations to inputs; that is, they make correct and confident predictions even when given largely perturbed inputs from which humans can not correctly derive answers. In addition, QA models fail to generalize to other domains and adversarial test sets, while humans maintain high accuracy. Based on these observations, we assume that QA models do not use intended features necessary for human reading but rely on spurious features, causing the lack of generalization ability. Therefore, we attempt to answer the question: If the overconfident predictions of QA models for various types of perturbations are penalized, will the out-of-distribution (OOD) generalization be improved? To prevent models from making confident predictions on perturbed inputs, we first follow existing studies and maximize the entropy of the output probability for perturbed inputs. However, we find that QA models trained to be sensitive to a certain perturbation type are often insensitive to unseen types of perturbations. Thus, we simultaneously maximize the entropy for the four perturbation types (i.e., word- and sentence-level shuffling and deletion) to further close the gap between models and humans. Contrary to our expectations, although models become sensitive to the four types of perturbations, we find that the OOD generalization is not improved. Moreover, the OOD generalization is sometimes degraded after entropy maximization. Making unconfident predictions on largely perturbed inputs per se may be beneficial to gaining human trust. However, our negative results suggest that researchers should pay attention to the side effect of entropy maximization.
翻译:问题解答( QA) 模型被显示对投入的大规模扰动不敏感; 也就是说, 即便对大量扰动投入给出了过度自信的预测, 人类也无法正确获得答案。 此外, QA 模型未能向其它领域和对立测试组进行概括化, 而人类则保持很高的准确性 。 根据这些观察, 我们假设 QA 模型没有使用人类阅读所需的预期特征,而是依赖虚假的特征, 导致缺乏概括化能力。 因此, 我们试图回答一个问题: 如果对大量扰动的多种类型对QA模型的过度自信预测受到处罚, 人类无法正确获得答案。 为了防止模型对其它领域做出自信预测, 我们首先遵循现有的研究, 并最大限度地增加被扰动输入的输出概率。 然而, 我们发现 QA 模型经过训练后对某种扰动性变异性特性敏感, 我们往往对隐蔽的扰动类型感到不敏感。 因此, 我们同时尽可能扩大分配分配( OOOOD) 的超常值效果, 但是, 也可以在四种类型中找到 。