We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their attribute values are out of the training distribution, posing challenges to existing methods that aim to directly generate the target sequence. Instead, in this work, we propose Iterative Controlled Extrapolation (ICE) which iteratively makes local edits to a sequence to enable extrapolation. We train the model on synthetically generated sequence pairs that demonstrate small improvement in the attribute value. Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. Our code and models are available at: https://github.com/vishakhpk/iter-extrapolation.
翻译:我们研究外推控制生成的问题,即产生超出培训所见范围范围的属性值的序列。这项任务在自动化设计,特别是药物发现中非常重要,目的是设计新的蛋白质,这些蛋白比现有序列要小得多(例如,更稳定),因此,根据定义,目标序列及其属性值不属于培训分布范围,对旨在直接生成目标序列的现有方法构成挑战。相反,在这项工作中,我们提议循环控制外推法(ICE)使本地编辑迭接到允许外推的序列。我们培训合成生成的序列配对模型,这些配对在属性值上稍有改进。一项自然语言任务(条件分析)和两项蛋白工程任务(ACE2稳定性和AAAV健身)的结果显示,ICE尽管简单,但仍大大超越了状态-艺术方法。我们的代码和模型见:https://github.com/vishakpk/iter-extrapologation。</s>