The explosion of scientific publications overloads researchers with information. This is even more dramatic for interdisciplinary studies, where several fields need to be explored. A tool to help researchers overcome this is Natural Language Processing (NLP): a machine-learning (ML) technique that allows scientists to automatically synthesize information from many articles. As a practical example, we have used NLP to conduct an interdisciplinary search for compounds that could be carriers for Diffuse Interstellar Bands (DIBs), a long-standing open question in astrophysics. We have trained a NLP model on a corpus of 1.5 million cross-domain articles in open access, and fine-tuned this model with a corpus of astrophysical publications about DIBs. Our analysis points us toward several molecules, studied primarily in biology, having transitions at the wavelengths of several DIBs and composed of abundant interstellar atoms. Several of these molecules contain chromophores, small molecular groups responsible for the molecule's colour, that could be promising candidate carriers. Identifying viable carriers demonstrates the value of using NLP to tackle open scientific questions, in an interdisciplinary manner.
翻译:科学出版物的爆炸使得研究人员在信息方面负担过重。对于跨学科研究来说,这甚至更加戏剧化,需要探索几个领域。帮助研究人员克服这一困难的工具是自然语言处理(NLP):一种机器学习(ML)技术,使科学家能够自动综合许多文章中的信息。作为一个实例,我们利用NLP对可能成为Diffuse Interstellar Bands(DIBs)载体的化合物进行跨学科搜索,这是长期存在的天体物理学中一个开放的问题。我们训练了一个NLP模型,该模型有150万份公开查阅的跨界文章,并用一系列关于DIBs的天体物理出版物对这个模型进行微调调整。我们的分析指出,我们主要在生物学领域研究过的几个分子,这些分子在几个DIBs的波长上有所转变,由丰富的星际原子组成。其中几个分子含有染色素,这些分子组负责分子的颜色,可以成为有前途的候选载体。