AdaSTaR：面向自教推理器的自适应数据采样训练方法 (AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners)

Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.

翻译：自教推理器（Self-Taught Reasoners，STaR），亦称拒绝采样微调（Rejection sampling Fine-Tuning，RFT），是自改进推理语言模型训练流程的关键组成部分。现有的自改进机制通常采用随机观测（数据）采样，但这会导致训练观测数据失衡：在已解决的样本上过度训练而效率低下，同时在困难样本上训练不足。为此，我们提出自适应STaR（AdaSTaR），一种新颖算法，通过整合两项自适应采样原则来纠正上述问题：（1）面向多样性的自适应采样：促进观测数据间的平衡训练；（2）面向课程的自适应采样：动态调整数据难度以匹配模型不断演进的能力。在六个基准测试中，AdaSTaR在所有情况下（6/6）均取得了最佳测试准确率，并且相较于一系列基线方法，平均减少了58.6%的训练浮点运算量。这些在性能与效率上的提升可推广至不同的预训练语言模型及更大规模的模型，为构建更高效、更有效的自改进语言模型开辟了道路。