Molecules 和自然语言之间的翻译 (Translation between Molecules and Natural Language)

Joint representations between images and text have been deeply investigated in the literature. In computer vision, the benefits of incorporating natural language have become clear for enabling semantic-level control of images. In this work, we present $\textbf{MolT5}-$a self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. $\textbf{MolT5}$ allows for new, useful, and challenging analogs of traditional vision-language tasks, such as molecule captioning and text-based de novo molecule generation (altogether: translation between molecules and language), which we explore for the first time. Furthermore, since $\textbf{MolT5}$ pretrains models on single-modal data, it helps overcome the chemistry domain shortcoming of data scarcity. Additionally, we consider several metrics, including a new cross-modal embedding-based metric, to evaluate the tasks of molecule captioning and text-based molecule generation. By interfacing molecules with natural language, we enable a higher semantic level of control over molecule discovery and understanding--a critical task for scientific domains such as drug discovery and material design. Our results show that $\textbf{MolT5}$-based models are able to generate outputs, both molecule and text, which in many cases are high quality and match the input modality. On molecule generation, our best model achieves 30% exact matching test accuracy (i.e., it generates the correct structure for about one-third of the captions in our held-out test set).

翻译：在文献中,对图像和文本之间的共同表现进行了深入的调查。在计算机视觉中,纳入自然语言的好处已经变得十分清楚,可以对图像进行语义级控制。在这项工作中,我们为大量未贴标签的自然语言文本和分子字符串的预培训模型展示了$textbf{MolT5}-美元自监督的学习框架。$textbf{MolT5}允许对传统视觉语言任务进行新的、有用的和具有挑战性的类比,例如分子字幕和基于文字的脱新分子生成(一起:分子和语言之间的翻译),这是我们第一次探索的。此外,由于在单式数据数据上我们展示了$tlebf{M5} 之前的模型,它帮助克服了大量化学领域数据短缺的短处。此外,我们考虑了一些指标,包括一个新的跨模式嵌嵌入式的量度,用来评估分子标题和基于文字的分子生成任务。通过与自然语言的互换分子,我们使得基于更高层次的数学的模型的数学水平匹配了我们发现和正确度。在高层次上,我们的数据测试5 测试模型的模型中,它能显示了我们的数据测试。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/