In this paper, we introduce FAMMA, an open-source benchmark for \underline{f}in\underline{a}ncial \underline{m}ultilingual \underline{m}ultimodal question \underline{a}nswering (QA). Our benchmark aims to evaluate the abilities of large language models (LLMs) in answering complex reasoning questions that require advanced financial knowledge. The benchmark has two versions: FAMMA-Basic consists of 1,945 questions extracted from university textbooks and exams, along with human-annotated answers and rationales; FAMMA-LivePro consists of 103 novel questions created by human domain experts, with answers and rationales held out from the public for a contamination-free evaluation. These questions cover advanced knowledge of 8 major subfields in finance (e.g., corporate finance, derivatives, and portfolio management). Some are in Chinese or French, while a majority of them are in English. Each question has some non-text data such as charts, diagrams, or tables. Our experiments reveal that FAMMA poses a significant challenge on LLMs, including reasoning models such as GPT-o1 and DeepSeek-R1. Additionally, we curated 1,270 reasoning trajectories of DeepSeek-R1 on the FAMMA-Basic data, and fine-tuned a series of open-source Qwen models using this reasoning data. We found that training a model on these reasoning trajectories can significantly improve its performance on FAMMA-LivePro. We released our leaderboard, data, code, and trained models at https://famma-bench.github.io/famma/.
翻译:本文介绍了FAMMA,一个面向金融领域多语言多模态问答的开源基准。该基准旨在评估大语言模型在回答需要高级金融知识的复杂推理问题上的能力。基准包含两个版本:FAMMA-Basic包含从大学教科书和考试中提取的1,945个问题,并附有人工标注的答案与推理过程;FAMMA-LivePro包含由人类领域专家创建的103个新颖问题,其答案与推理过程未公开,以确保评估免受数据污染。这些问题涵盖了金融学8个主要子领域(如公司金融、衍生品和投资组合管理)的高级知识。部分问题以中文或法文呈现,而大多数问题为英文。每个问题都包含一些非文本数据,如图表、图示或表格。我们的实验表明,FAMMA对包括GPT-o1和DeepSeek-R1在内的推理模型构成了显著挑战。此外,我们整理了DeepSeek-R1在FAMMA-Basic数据上的1,270条推理轨迹,并利用此推理数据微调了一系列开源的Qwen模型。我们发现,基于这些推理轨迹训练模型能显著提升其在FAMMA-LivePro上的性能。我们已在https://famma-bench.github.io/famma/ 公开了排行榜、数据、代码及训练好的模型。