自然科学中的数据集比:化学反应预测和综合设计案例研究 (Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction Prediction and Synthesis Design)

Datasets in the Natural Sciences are often curated with the goal of aiding scientific understanding and hence may not always be in a form that facilitates the application of machine learning. In this paper, we identify three trends within the fields of chemical reaction prediction and synthesis design that require a change in direction. First, the manner in which reaction datasets are split into reactants and reagents encourages testing models in an unrealistically generous manner. Second, we highlight the prevalence of mislabelled data, and suggest that the focus should be on outlier removal rather than data fitting only. Lastly, we discuss the problem of reagent prediction, in addition to reactant prediction, in order to solve the full synthesis design problem, highlighting the mismatch between what machine learning solves and what a lab chemist would need. Our critiques are also relevant to the burgeoning field of using machine learning to accelerate progress in experimental Natural Sciences, where datasets are often split in a biased way, are highly noisy, and contextual variables that are not evident from the data strongly influence the outcome of experiments.

翻译：自然科学中的数据集往往以帮助科学理解为目的,因此可能并不总是以有利于机器学习应用的形式加以整理。在本文中,我们确定了化学反应预测和合成设计领域需要改变方向的三种趋势。首先,反应数据集被分成反应数据集和试剂的方式鼓励了不切实际的慷慨试验模型。其次,我们强调标签错误数据的流行,并建议重点应放在外部清除而不是仅仅适应数据。最后,我们讨论试剂预测问题,除了反应性预测之外,还探讨再试剂预测问题,以便解决全面合成设计问题,强调机器学习的解决方案与实验室化学家需要的不匹配。我们的批评也与利用机器学习加速实验性自然科学进步的新兴领域有关,在实验性自然科学中,数据集往往以偏颇的方式分裂,非常吵闹,从数据中看不出影响实验结果的背景变量。