Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.
翻译:从非结构化文本中明智地提取和链接复杂的科学信息是一项艰巨的任务,对那些没有自然语言处理经验的人来说尤其如此。在这里,我们提出了一个简单的顺序到顺序的方法,用于联合命名实体的确认和科学文本中复杂等级信息的关系提取。这种方法利用了预先训练的大型语言模型(LLM)、GPT-3,该模型对大约500对提示(投入)和完成(产出)进行了微调,从单句或摘要/访问中各句中提取信息,而产出可以作为简单的英语句子或结构化更强的格式(如JSON物体清单)返回。我们证明,通过这种方式培训的LLMS能够准确地提取复杂的科学知识记录,用于材料化学方面的三项具有代表性的任务:将工具与主机床材料联系起来,对金属-有机框架进行编目,以及一般化学/阶段/形态学/应用信息提取。这一方法是一个简单、易懂和高度灵活的路径,可以获取从结构化文本中提取的大型结构化知识数据库。我们可在http://www.mostroporalexexexexaction。