Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.
翻译:数学推理技能对于一般用途智能系统执行从杂货购物到气候建模等任务至关重要。为了评估和改进这一领域的AI系统,我们提议LILA,这是一个统一的数学推理基准,包括23项不同任务,分四个方面:(一)数学能力,如计算、计算(二)语言格式,如问答、填字(三)语言多样性,如没有语言、简单语言(四)外部知识,如常识、物理学等。我们通过以Python程序的形式收集任务指示和解决方案,以扩大20个数据集基准,从而在正确答案之外获得可解释的解决办法。我们还引入了两个评价数据集,以衡量分配性业绩和对语言的稳健性。最后,我们引入了BHAS KARA,这是在LILA上培训的通用数学推理模型。重要的是,我们发现多任务导致重大改进(平均相对改进21.83%的F1分评分比数和40分数,同时进行数学推理,只有40分数模式和数学推理)。