Design of de novo biological sequences with desired properties, like protein and DNA sequences, often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations. These experiments can consist of multiple stages, with increasing levels of precision and cost of evaluation, where candidates are filtered. This makes the diversity of proposed candidates a key consideration in the ideation phase. In this work, we propose an active learning algorithm leveraging epistemic uncertainty estimation and the recently proposed GFlowNets as a generator of diverse candidate solutions, with the objective to obtain a diverse batch of useful (as defined by some utility function, for example, the predicted anti-microbial activity of a peptide) and informative candidates after each round. We also propose a scheme to incorporate existing labeled datasets of candidates, in addition to a reward function, to speed up learning in GFlowNets. We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches.
翻译:设计具有理想特性的生物新序列,如蛋白质和DNA序列,往往涉及与数轮分子感知和昂贵湿实验室评估的积极循环。这些实验可以包括多个阶段,其中精确度和成本不断提高,对候选人进行过滤。这使得拟议候选人的多样性成为构想阶段的一个关键考虑因素。在这项工作中,我们提出一个积极的学习算法,利用认知性不确定性估计和最近提议的GFlowNet作为不同候选解决方案的生成者,目的是获得一批多样有用的(例如,根据某种实用功能的定义,预测的一剂浸泡剂的抗微生物活动)和每轮之后的知情候选人。我们还提议了一个计划,在奖励功能之外,纳入现有的有标签的候选人数据集,以加快在GFlowNet的学习。我们介绍了几项生物序列设计任务的经验结果,我们发现,我们的方法产生了与现有方法相比,更多样化和新颖的一批高评分候选人。