One-dimensional NMR spectroscopy is one of the most widely used techniques for the characterization of organic compounds and natural products. For molecules with up to 36 non-hydrogen atoms, the number of possible structures has been estimated to range from $10^{20} - 10^{60}$. The task of determining the structure (formula and connectivity) of a molecule of this size using only its one-dimensional $^1$H and/or $^{13}$C NMR spectrum, i.e. de novo structure generation, thus appears completely intractable. Here we show how it is possible to achieve this task for systems with up to 40 non-hydrogen atoms across the full elemental coverage typically encountered in organic chemistry (C, N, O, H, P, S, Si, B, and the halogens) using a deep learning framework, thus covering a vast portion of the drug-like chemical space. Leveraging insights from natural language processing, we show that our transformer-based architecture predicts the correct molecule with 55.2% accuracy within the first 15 predictions using only the $^1$H and $^{13}$C NMR spectra, thus overcoming the combinatorial growth of the chemical space while also being extensible to experimental data via fine-tuning.
翻译:一维核磁共振光谱是有机化合物与天然产物表征中应用最广泛的技术之一。对于非氢原子数不超过36的分子,其可能结构的数量估计在$10^{20} - 10^{60}$之间。因此,仅利用一维$^1$H和/或$^{13}$C核磁共振谱来确定此类尺寸分子的结构(分子式与连接关系),即从头结构生成,似乎完全无法实现。本文中,我们展示了如何通过深度学习框架,在有机化学常见元素(C、N、O、H、P、S、Si、B及卤素)全覆盖范围内,对非氢原子数不超过40的体系实现这一目标,从而覆盖了类药化学空间的绝大部分。借鉴自然语言处理领域的思路,我们证明了基于Transformer的架构仅利用$^1$H和$^{13}$C核磁共振谱,在前15个预测结果中即可实现55.2%的准确率预测出正确分子。该方法不仅克服了化学空间组合爆炸式增长的难题,还能通过微调扩展到实验数据的处理。