Predicting the chemical properties of compounds is crucial in discovering novel materials and drugs with specific desired characteristics. Recent significant advances in machine learning technologies have enabled automatic predictive modeling from past experimental data reported in the literature. However, these datasets are often biased because of various reasons, such as experimental plans and publication decisions, and the prediction models trained using such biased datasets often suffer from over-fitting to the biased distributions and perform poorly on subsequent uses. Hence, this study focused on mitigating bias in the experimental datasets. We adopted two techniques from causal inference and domain adaptation combined with graph neural networks that can represent molecular structures. The experimental results in four possible bias scenarios indicated that the inverse propensity scoring-based method made solid improvements, but the domain-invariant representation learning approach failed.
翻译:预测化合物的化学特性对于发现具有特定预期特性的新材料和药物至关重要。机器学习技术的最近重大进展使得能够根据文献中报告的过去实验数据进行自动预测模型;然而,由于实验计划和出版决定等各种原因,这些数据集往往有偏差,而使用这种偏差数据集所培训的预测模型往往过分适应偏差分布,在随后的用途上表现不佳。因此,这项研究的重点是减少实验数据集中的偏差。我们采用了因果推断和域调整两种技术,加上可代表分子结构的图形神经网络。实验结果显示,四种可能的偏差情景表明,反偏差偏差评分法取得了坚实的改进,但域反差代表学习法却失败了。