Data-driven design shows the promise of accelerating materials discovery but is challenging due to the prohibitive cost of searching the vast design space of chemistry, structure, and synthesis methods. Bayesian Optimization (BO) employs uncertainty-aware machine learning models to select promising designs to evaluate, hence reducing the cost. However, BO with mixed numerical and categorical variables, which is of particular interest in materials design, has not been well studied. In this work, we survey frequentist and Bayesian approaches to uncertainty quantification of machine learning with mixed variables. We then conduct a systematic comparative study of their performances in BO using a popular representative model from each group, the random forest-based Lolo model (frequentist) and the latent variable Gaussian process model (Bayesian). We examine the efficacy of the two models in the optimization of mathematical functions, as well as properties of structural and functional materials, where we observe performance differences as related to problem dimensionality and complexity. By investigating the machine learning models' predictive and uncertainty estimation capabilities, we provide interpretations of the observed performance differences. Our results provide practical guidance on choosing between frequentist and Bayesian uncertainty-aware machine learning models for mixed-variable BO in materials design.
翻译:数据驱动的设计显示了加速材料发现的前景,但由于寻找化学、结构和合成方法的巨大设计空间的成本高昂,寻找大量化学、结构和合成方法的设计空间的费用令人望而却步,因此具有挑战性。Bayesian Optimination(BO)使用有不确定性的机器学习模型选择有希望的设计来进行评估,从而降低成本。然而,没有很好地研究对材料设计特别感兴趣的混合数字变量和绝对变量。在这项工作中,我们调查了机器学习的常客和贝叶西亚方法,对机器学习的不确定性和混杂变量进行定量的不确定性的不确定性。然后,我们利用每个组的流行代表性模型,即随机森林模型(累数模型)和潜伏变数高斯进程模型(Bayyesian),对其在Bayesian的不确定性和不确定性-aware机器设计模型中选择常见和易变数的模型,以学习混合模型,我们提供了实用的指导。