Growing materials data and data-centric informatics tools drastically promote the discovery and design of materials. While data-driven models, such as machine learning, have drawn much attention and observed significant progress, the quality of data resources is equally important but less studied. In this work, we focus on bias mitigation, an important aspect of materials data quality. Quantifying the diversity of stability in different crystal systems, we propose a metric for measuring structure-stability bias in materials data. To mitigate the bias, we develop an entropy-target active learning (ET-AL) framework, guiding the acquisition of new data so that diversities of underrepresented crystal systems are improved, thus mitigating the bias. With experiments on materials datasets, we demonstrate the capability of ET-AL and the improvement in machine learning models through bias mitigation. The approach is applicable to data-centric informatics in other scientific domains.
翻译:不断增长的材料数据和以数据为中心的信息学工具极大地促进了材料的发现和设计。虽然机器学习等数据驱动模型引起人们的注意并观察到重大的进展,但数据资源的质量同样重要,但研究较少。在这项工作中,我们注重减少偏差,这是材料数据质量的一个重要方面。我们提出了衡量不同晶体系统稳定性多样性的衡量标准,以衡量材料数据的结构稳定性偏差。为了减少偏差,我们开发了一个正向目标积极学习框架(ET-AL),指导获取新数据,从而改进代表性不足的晶体系统的多样性,从而减轻偏差。我们通过对材料数据集的试验,展示ET-AL的能力以及通过减少偏差改进机器学习模型。这种方法适用于其他科学领域的以数据为中心的信息学。