Evaluation of OpenAI o1: Opportunities and Challenges of AGI

Tianyang Zhong,Zhengliang Liu,Yi Pan,Yutong Zhang,Yifan Zhou,Shizhe Liang,Zihao Wu,Yanjun Lyu,Peng Shu,Xiaowei Yu,Chao Cao,Hanqi Jiang,Hanxu Chen,Yiwei Li,Junhao Chen,Huawen Hu,Yihen Liu,Huaqin Zhao,Shaochen Xu,Haixing Dai,Lin Zhao,Ruidong Zhang,Wei Zhao,Zhenyuan Yang,Jingyuan Chen,Peilong Wang,Wei Ruan,Hui Wang,Huan Zhao,Jing Zhang,Yiming Ren,Shihuan Qin,Tong Chen,Jiaxi Li,Arif Hassan Zidan,Afrar Jahin,Minheng Chen,Sichen Xia,Jason Holmes,Yan Zhuang,Jiaqi Wang,Bochen Xu,Weiran Xia,Jichao Yu,Kaibo Tang,Yaxuan Yang,Bolun Sun,Tao Yang,Guoyu Lu,Xianqiao Wang,Lilong Chai,He Li,Jin Lu,Lichao Sun,Xin Zhang,Bao Ge,Xintao Hu,Lian Zhang,Hua Zhou,Lu Zhang,Shu Zhang,Ninghao Liu,Bei Jiang,Linglong Kong,Zhen Xiang,Yudan Ren,Jun Liu,Xi Jiang,Yu Bao,Wei Zhang,Xiang Li,Gang Li,Wei Liu,Dinggang Shen,Andrea Sikora,Xiaoming Zhai,Dajiang Zhu,Tianming Liu

This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.

翻译：暂无翻译