Growing materials data and data-driven informatics drastically promote the discovery and design of materials. While there are significant advancements in data-driven models, the quality of data resources is less studied despite its huge impact on model performance. In this work, we focus on data bias arising from uneven coverage of materials families in existing knowledge. Observing different diversities among crystal systems in common materials databases, we propose an information entropy-based metric for measuring this bias. To mitigate the bias, we develop an entropy-targeted active learning (ET-AL) framework, which guides the acquisition of new data to improve the diversity of underrepresented crystal systems. We demonstrate the capability of ET-AL for bias mitigation and the resulting improvement in downstream machine learning models. This approach is broadly applicable to data-driven materials discovery, including autonomous data acquisition and dataset trimming to reduce bias, as well as data-driven informatics in other scientific domains.
翻译:不断增长的材料数据和数据驱动的信息学极大地促进了材料的发现和设计。虽然在数据驱动模型方面有显著进步,但数据资源的质量研究较少,尽管其对模型性能有巨大影响。在这项工作中,我们注重现有知识中材料家庭覆盖面不均衡所产生的数据偏差。在共同材料数据库中观测晶体系统的不同多样性,我们建议采用基于信息酶的测量方法来衡量这种偏差。为减少偏差,我们开发了一个针对对子的主动学习(ET-AL)框架,用以指导获取新数据,以改善代表性不足的晶体系统的多样性。我们展示了ET-AL减少偏差的能力,从而改进了下游机器学习模式。这一方法广泛适用于数据驱动材料的发现,包括自主数据采集和数据集的三角,以减少偏差,以及其他科学领域的数据驱动信息学。