Machine intelligence marks the ultimate dream of making machines' intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at https://github.com/MetaEvo/ERO for reproduction needs.
翻译:机器智能标志着使机器智能媲美人类的终极梦想。尽管近期大型语言模型(LLMs)在广泛下游任务中展现出显著的特定技能,但其在通用智能方面仍或多或少存在不足。基于智能与系统2推理(慢思考)之间的关联,本文旨在回答一个值得探讨的研究问题:机器智能(如LLMs)能否像人类一样通过进化获得推理能力(而非特定技能)?为此,我们提出进化推理优化(ERO)框架,该框架通过对LLMs群体实施适者生存策略,以搜寻具备强推理能力的个体。给定推理任务时,ERO首先初始化多个LLMs作为初始群体,随后通过进化策略演化该群体,以最大化最佳个体的量化推理分数。基于代表性测试集的实验,我们提出两项令人惊讶的实证发现:i)最新LLMs(如GPT-5)仍表现出有限的系统2推理能力;ii)通过ERO的简单进化循环,相对较弱的模型(Qwen-7B)可被增强以涌现强大的推理能力。本项目代码发布于https://github.com/MetaEvo/ERO以供复现研究。