When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io.
翻译:回答一个问题时,人类利用不同模式的现有信息,合成一致和完整的思维链(CoT)。这一过程通常是深学习模式(如大型语言模型)的黑盒。最近,科学问题基准被用来诊断一个AI系统的多点推理能力和可解释性。然而,现有的数据集未能提供答案的说明,或仅限于仅用文字解答的方式、小尺度和有限的领域多样性。为此,我们提出了科学问题解答(ScienceQA)这一新的基准,包括~21k多式选择问题,包括一套不同的科学专题和说明其答案的批注。我们进一步设计语言模型,学习如何作为思维链(CoT)来生成讲座和解释性解释。在回答ScienceQA问题时,科学QA显示Cot在语言模型中的实用性,因为CotT的回答率提高了1.20%。GPT-3和3.99 %在微调统一QA中,我们用高点解解码来进一步解问题。我们用高端解说,我们用G-3模型来分析结果。我们用高端解算了G-3.96数据。我们用这些数据,用数据来进一步分析。