Tackling the most pressing problems for humanity, such as the climate crisis and the threat of global pandemics, requires accelerating the pace of scientific discovery. While science has traditionally relied on trial and error and even serendipity to a large extent, the last few decades have seen a surge of data-driven scientific discoveries. However, in order to truly leverage large-scale data sets and high-throughput experimental setups, machine learning methods will need to be further improved and better integrated in the scientific discovery pipeline. A key challenge for current machine learning methods in this context is the efficient exploration of very large search spaces, which requires techniques for estimating reducible (epistemic) uncertainty and generating sets of diverse and informative experiments to perform. This motivated a new probabilistic machine learning framework called GFlowNets, which can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop. GFlowNets learn to sample from a distribution given indirectly by a reward function corresponding to an unnormalized probability, which enables sampling diverse, high-reward candidates. GFlowNets can also be used to form efficient and amortized Bayesian posterior estimators for causal models conditioned on the already acquired experimental data. Having such posterior models can then provide estimators of epistemic uncertainty and information gain that can drive an experimental design policy. Altogether, here we will argue that GFlowNets can become a valuable tool for AI-driven scientific discovery, especially in scenarios of very large candidate spaces where we have access to cheap but inaccurate measurements or to expensive but accurate measurements. This is a common setting in the context of drug and material discovery, which we use as examples throughout the paper.
翻译:处理人类最紧迫的问题,如气候危机和全球流行病的威胁,需要加快科学发现的速度。科学历来依赖试验和错误,甚至在很大程度上依赖精度,而过去几十年却目睹了数据驱动的科学发现剧增。然而,为了真正利用大规模数据集和高通量实验设置,机器学习方法需要进一步改进和更好地纳入科学发现管道。当前机器学习方法在这方面的一个关键挑战是高效探索非常庞大的搜索空间,这需要各种技术来估计可复制(活性)的不确定性,并产生一系列不同的和丰富的实验性实验来进行。这促使形成了一个新的概率性机器学习框架,称为GFlowNets,可用于实验性科学循环的模型、假设生成和实验性设计阶段。GFlowNets需要从一个奖励性功能中间接的分布中学习样本,该奖赏性功能与不均匀的概率相对,但能够对不同的、高回报性的候选人进行取样。GFowlowNets 还可以特别用来在高额和精确性的科学统计模型中形成一个高效的、易变现的模型。