Question-answering datasets require a broad set of reasoning skills. We show how to use question decompositions to teach language models these broad reasoning skills in a robust fashion. Specifically, we use widely available QDMR representations to programmatically create synthetic contexts for real questions in six multihop reasoning datasets. These contexts are carefully designed to avoid common reasoning shortcuts prevalent in real contexts that prevent models from learning the right skills. This results in a pretraining dataset, named TeaBReaC, containing 525K multihop questions (with associated formal programs) covering about 900 reasoning patterns. We show that pretraining standard language models (LMs) on TeaBReaC before fine-tuning them on target datasets improves their performance by up to 13 EM points across 3 multihop QA datasets, with a 30 point gain on more complex questions. The resulting models also demonstrate higher robustness, with a 6-11 point improvement on two contrast sets. Furthermore, TeaBReaC pretraining substantially improves model performance and robustness even when starting with numeracy-aware LMs pretrained using recent methods (e.g., PReasM). Our work thus shows how one can effectively use decomposition-guided contexts to robustly teach multihop reasoning.
翻译:回答问题数据集需要一套广泛的推理技巧。 我们展示了如何使用问题分解方法教授语言模型, 以稳健的方式教授这些广泛的推理技巧。 具体地说, 我们使用广泛可得的QDMR 演示方法, 以在6个多点推理数据集中为真实问题创造合成环境。 这些背景设计谨慎, 以避免在真实环境中常见的共同推理捷径, 阻止模型学习正确的技能。 这导致一个叫TeaBReac 的预培训数据集, 包含525K多点问题( 以及相关的正式程序), 涵盖大约900个推理模式。 我们显示, 在对目标数据集进行微调之前,先对 TeaBReaC 的标准语言模型进行训练, 其性能将提高至13个EM点, 跨3个多点QA 数据集, 并在更复杂的问题上获得30点的增益。 由此产生的模型还显示出更高的稳健性, 在两套对比组合上改进了6-11点。 此外, TeBReaRC 预先培训前的模型业绩和稳健健性, 即使在开始使用最新方法进行计数解后, 我们的推理学工作。