With machine learning being a popular topic in current computational materials science literature, creating representations for compounds has become common place. These representations are rarely compared, as evaluating their performance - and the performance of the algorithms that they are used with - is non-trivial. With many materials datasets containing bias and skew caused by the research process, leave one cluster out cross validation (LOCO-CV) has been introduced as a way of measuring the performance of an algorithm in predicting previously unseen groups of materials. This raises the question of the impact, and control, of the range of cluster sizes on the LOCO-CV measurement outcomes. We present a thorough comparison between composition-based representations, and investigate how kernel approximation functions can be used to better separate data to enhance LOCO-CV applications. We find that domain knowledge does not improve machine learning performance in most tasks tested, with band gap prediction being the notable exception. We also find that the radial basis function improves the linear separability of chemical datasets in all 10 datasets tested and provide a framework for the application of this function in the LOCO-CV process to improve the outcome of LOCO-CV measurements regardless of machine learning algorithm, choice of metric, and choice of compound representation. We recommend kernelised LOCO-CV as a training paradigm for those looking to measure the extrapolatory power of an algorithm on materials data.
翻译:在目前计算材料科学文献中,机器学习成为流行的话题,为化合物建立代表制已成为常见的话题。这些代表制很少被比较,因为评价其性能――以及它们所使用的算法的性能――是非三重性的。许多含有偏差和偏差的材料数据集都由研究过程造成,因此作为衡量一种算法在预测以前不为人知的材料组方面的性能的一种方法,将一个组别(LOCO-CV)作为衡量方法,我们发现,辐射基础功能提高了所有10个测试的数据集中化学数据集的线性可耐受性,并为在LCO-CV衡量结果中应用这一功能提供了一个框架。我们对基于构成的表示制的表示式进行彻底比较,并调查如何利用内核近似功能来更好地将数据分开,以加强LCO-C应用LOV应用。我们发现,在多数测试的任务中,域知识并不能改进机器学习的性能,而频差预测是显著的例外。我们还发现,辐射基础功能提高了所有10个数据集中化学数据集的直线性可分离性,并且为在LCO-C公司选择模型的模型选择过程提供了一个框架应用应用这一功能的框架,我们选择了LOC-C的模型的模型选择结果的模型,作为学习模型的模型的模型的模型的模型的变校程的模型的模型的模型的模型的模型的校外演。