Joint representations between images and text have been deeply investigated in the literature. In computer vision, the benefits of incorporating natural language have become clear for enabling semantic-level control of images. In this work, we present $\textbf{MolT5}-$a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Furthermore, since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Additionally, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. By interfacing molecules with natural language, we enable a higher semantic level of control over molecule discovery and understanding--a critical task for scientific domains such as drug discovery and material design. Our results show that $\textbf{MolT5}$-based models are able to generate outputs, both molecule and text, which in many cases are high quality and match the input modality. On molecule generation, our best model achieves 30% exact matching test accuracy (i.e., it generates the correct structure for about one-third of the captions in our held-out test set).
翻译:在文献中,对图像和文本之间的共同表现进行了深入的调查。在计算机视觉中,纳入自然语言的好处已经变得十分清楚,可以对图像进行语义级控制。在这项工作中,我们为大量未贴标签的自然语言文本和分子字符串的预培训模型展示了$textbf{MolT5}-美元自监督的学习框架。$textbf{MolT5}允许对传统视觉语言任务进行新的、有用的和具有挑战性的类比,例如分子字幕和基于文字的脱新分子生成(一起:分子和语言之间的翻译),这是我们第一次探索的。此外,由于在单式数据数据上我们展示了$tlebf{M5} 之前的模型,它帮助克服了大量化学领域数据短缺的短处。此外,我们考虑了一些指标,包括一个新的跨模式嵌嵌入式的量度,用来评估分子标题和基于文字的分子生成任务。通过与自然语言的互换分子,我们使得基于更高层次的数学的模型的数学水平 匹配了我们发现和正确度。在高层次上,我们的数据测试5 测试模型的模型中,它能显示了我们的数据测试 。