Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (> 90% for Natural Language Processing-based ones). With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve artificial intelligence models' performance in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We applied this method to the collection of chemical reactions Pistachio and to an open data set, both extracted from USPTO (United States Patent Office) patents. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets. For the retrosynthetic models, the round-trip accuracy metric grows by 13 percentage points and the value of the cumulative Jensen Shannon divergence decreases by 30% compared to its original record. The coverage remains high with 97%, and the value of the class-diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical data sets.
翻译:用于有机化学反应预测的现有深层次学习模型可以达到很高的准确度(超过90%用于自然语言处理的模型)。由于从反应数据中获取的信息没有化学知识,数据集的质量在预测模型的运行中发挥着关键作用。虽然人类曲线成本高得令人望而却步,但需要采用无助的方法从现有数据集中删除化学错误条目,以提高人工智能模型在合成化学任务中的性能。我们在这里建议采用一种基于机械学习的、无辅助的方法,从化学反应收集中去除化学错误条目。我们用这种方法收集化学反应Pistachio和开放数据集,两者都是从美国专利局专利中提取的。我们的结果显示,经过清洁和平衡数据集培训的模型的预测质量有所提高。对于追溯合成模型来说,圆轨准确度指标增加了13个百分点,累计的Jensen香农差异比其原始记录减少了30%。覆盖面仍然很高,为97%,等级多样性的价值不受清洗影响。拟议的战略是自动降低化学风险的技术。