通过多任务语言建模统一分子和文字代表 (Unifying Molecular and Textual Representations via Multi-task Language Modelling)

The recent advances in neural language models have also been successfully applied to the field of chemistry, offering generative solutions for classical problems in molecular design and synthesis planning. These new methods have the potential to optimize laboratory operations and fuel a new era of data-driven automation in scientific discovery. However, specialized models are still typically required for each task, leading to the need for problem-specific fine-tuning and neglecting task interrelations. The main obstacle in this field is the lack of a unified representation between natural language and chemical representations, complicating and limiting human-machine interaction. Here, we propose a multi-domain, multi-task language model to solve a wide range of tasks in both the chemical and natural language domains. By leveraging multi-task learning, our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models. Interestingly, sharing weights across domains remarkably improves our model when benchmarked against state-of-the-art baselines on single-domain and cross-domain tasks. In particular, sharing information across domains and tasks gives rise to large improvements in cross-domain tasks, the magnitude of which increase with scale, as measured by more than a dozen of relevant metrics. Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences by superseding problem-specific fine-tuning and enhancing human-model interactions.

翻译：最近神经语言模型的进展也成功地应用于化学领域,为分子设计和合成规划方面的古老问题提供了基因化解决办法,这些新方法有可能优化实验室运作,为科学发现中数据驱动自动化的新时代提供动力;然而,每个任务仍通常都需要专门模型,导致需要针对具体问题的微调和忽视任务相互关系;该领域的主要障碍是自然语言和化学表现之间缺乏统一的代表性,使人类机器互动复杂化和受到限制。在这方面,我们提议了一个多领域、多任务语言模型,以解决化学和自然语言领域的广泛任务。通过利用多任务学习,我们的模型可以同时处理化学和自然语言,而无需在单一领域或特定任务模式方面进行昂贵的预先培训。有趣的是,在根据单一领域和跨领域任务方面的最新基线进行基准衡量时,分享我们的模型大大改进了我们的模式。特别是,在跨领域和任务方面交流信息,可以大大改进化学和自然语言领域的广泛任务。通过利用多任务学习,我们的模型可以同时处理化学和自然语言,而无需在单一领域或特定任务模式上进行昂贵的训练前期培训。有趣的是,各领域之间分享我们的模型可以大大改进我们的模型,通过提高人类的体能衡量的体力和加速的模型,从而显示在提高人类的进度上的精确的模型上会提高的难度会提高。