While large pre-trained language models are powerful, their predictions often lack logical consistency across test inputs. For example, a state-of-the-art Macaw question-answering (QA) model answers 'Yes' to 'Is a sparrow a bird?' and 'Does a bird have feet?' but answers 'No' to 'Does a sparrow have feet?'. To address this failure mode, we propose a framework, Consistency Correction through Relation Detection, or ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models using pre-trained natural language inference (NLI) models without fine-tuning or re-training. Given a batch of test inputs, ConCoRD samples several candidate outputs for each input and instantiates a factor graph that accounts for both the model's belief about the likelihood of each answer choice in isolation and the NLI model's beliefs about pair-wise answer choice compatibility. We show that a weighted MaxSAT solver can efficiently compute high-quality answer choices under this factor graph, improving over the raw model's predictions. Our experiments demonstrate that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models using off-the-shelf NLI models, notably increasing accuracy of LXMERT on ConVQA by 5% absolute. See https://ericmitchell.ai/emnlp-2022-concord/ for code and data.
翻译:虽然受过训练的大型语言模型是强大的,但其预测往往缺乏测试投入的逻辑一致性。例如,一个最先进的Macaw问答(QA)模型回答“是”是“麻雀一只鸟吗?”和“鸟有脚吗?”但回答“没有”是“麻雀有脚吗?”。为了解决这一失败模式,我们提出了一个框架,即“通过关系探测校正校正”或“ConConCORD”,用预先训练的自然语言推断(NLI)模型提高经过训练的NLP模型的一致性和准确性,而不进行微调或再培训。鉴于一系列测试投入,ConCORD为每个输入选取了一些候选产出,并快速给出一个要素图,既说明模型对孤立中每个答案选择的可能性的信念,又说明NLIF模型对双向回答选择的兼容性。我们显示,加权的MaxSAT解答者可以在这个要素图下高效率地进行高质量的回答选择,改进原始模型的预测。我们的实验显示,CONA-LIS的准确性,当然地展示了NLISQ。