Biological machine learning is often bottlenecked by a lack of scaled data. One promising route to relieving data bottlenecks is through high throughput screens, which can experimentally test the activity of $10^6-10^{12}$ protein sequences in parallel. In this article, we introduce algorithms to optimize high throughput screens for data creation and model training. We focus on the large scale regime, where dataset sizes are limited by the cost of measurement and sequencing. We show that when active sequences are rare, we maximize information gain if we only collect positive examples of active sequences, i.e. $x$ with $y>0$. We can correct for the missing negative examples using a generative model of the library, producing a consistent and efficient estimate of the true $p(y | x)$. We demonstrate this approach in simulation and on a large scale screen of antibodies. Overall, co-design of experiments and inference lets us accelerate learning dramatically.
翻译:生物机器学习常因缺乏规模化数据而遭遇瓶颈。缓解数据瓶颈的一条可行途径是利用高通量筛选技术,该技术能够并行实验测试$10^6-10^{12}$条蛋白质序列的活性。本文提出用于优化高通量筛选以生成数据和训练模型的算法。我们聚焦于大规模筛选场景,其中数据集规模受限于测量与测序成本。研究表明,当活性序列较为稀少时,若仅收集活性序列的正样本(即满足$y>0$的$x$),可实现信息增益最大化。通过利用文库的生成模型对缺失的负样本进行校正,我们能够获得真实条件概率$p(y | x)$的一致且高效估计。该方法在仿真实验及大规模抗体筛选中均得到验证。总体而言,实验设计与推断过程的协同设计使我们能够显著加速学习进程。