反馈下降：基于成对比较的开放式文本优化 (Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison)

We introduce \textit{Feedback Descent}, a framework that optimizes text artifacts -- prompts, code, and molecules -- through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the $99.9$th percentile of a database with more than $260{,}000$ compounds across six protein targets.

翻译：本文提出\textit{反馈下降}框架，该框架通过结构化文本反馈（而非仅依赖标量奖励）来优化文本产物——包括提示、代码和分子。该方法通过保留详细评析而非将其压缩为二元偏好，拓宽了偏好学习中的信息瓶颈，从而实现在文本空间而非权重空间中的定向优化。我们证明情境学习可将结构化反馈转化为类梯度的方向性信息，进而实现针对性编辑。与先前将判断坍缩为单个比特的方法不同，本研究的评估器为每次比较配对文本反馈，以此作为高带宽监督。迭代循环完全在推理阶段完成，无需修改任何模型权重，且与任务无关。我们在三个不同领域评估反馈下降框架，发现其性能优于最先进的提示优化方法（GEPA）、强化学习方法（GRPO、REINVENT），甚至超越专门的基于图的分子优化器。在DOCKSTRING分子发现基准测试中，反馈下降识别出的新型类药分子，在针对六个蛋白质靶点的超过$260{,}000$种化合物数据库中，其性能超越了$99.9$百分位。