探索离线模式模式优化的验证指标 (Exploring validation metrics for offline model-based optimisation)

In offline model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of desirability through an expensive but real-world scoring process. Offline MBO tries to approximate this expensive scoring function and use that to evaluate generated designs, however evaluation is non-exact because one approximation is being evaluated with another. Instead, we ask ourselves: if we did have the real world scoring function at hand, what cheap-to-compute validation metrics would correlate best with this? Since the real-world scoring function is available for simulated MBO datasets, insights obtained from this can be transferred over to real-world offline MBO tasks where the real-world scoring function is expensive to compute. To address this, we propose a conceptual evaluation framework that is amenable to measuring extrapolation, and apply this to conditional denoising diffusion models. Empirically, we find that two validation metrics -- agreement and Frechet distance -- correlate quite well with the ground truth. When there is high variability in conditional generation, feedback is required in the form of an approximated version of the real-world scoring function. Furthermore, we find that generating high-scoring samples may require heavily weighting the generative model in favour of sample quality, potentially at the cost of sample diversity.

翻译：在离线模型优化(MBO)中,我们有兴趣利用机器学习来设计通过一个昂贵但真实世界的评分过程,使某种程度的可取性最大化的候选人。离线MBO试图接近这个昂贵的评分功能,并使用这种评分来评价所产生的设计,然而,由于对一个近似值与另一个对一个近似值进行评价,评价并不完美。相反,我们自问:如果我们确实掌握着真实世界的评分功能,那么,什么廉价到计算的验证指标最能与这个功能相关?由于模拟MBO数据集具有真实世界的评分功能,从中获得的评分可以转移到真实世界的离线 MBO任务,而实际世界评分功能的计算费用昂贵。为了解决这个问题,我们提出了一个概念评价框架,这个框架可以用来测量外推法,并应用于有条件的分解扩散模型。很自然,我们发现两种验证指标 -- -- 协议和Frechet的距离 -- -- 与地面的真相密切相关。在模拟的生成中存在高度的变异性时,从中获取的评分可以转移到现实世界的近似版本的多元性,在重度的基因的样品质量中,因此,我们可能需要产生高度的样品的样品的样品质量。