This article presents a new NLP task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science. We accomplished this task by tuning GPT-3 on an existed perovskite solar cell FAIR(Findable, Accessible, Interoperable, Reusable) dataset with 91.8 F1-score and we updated the dataset with all related scientific papers up to now. The produced dataset is formatted and normalized, enabling its direct utilization as input in subsequent data analysis. This feature will enable materials scientists to develop their own models by selecting high-quality review papers within their domain. Furthermore, we designed experiments to predict solar cells' electrical performance and reverse-predict parameters on both material gene and FAIR datesets through LLM. We obtained comparable performance with traditional machine learning methods without feature selection, which demonstrates the potential of large language models to judge materials and design new materials like a materials scientist.
翻译:Translated Abstract:
本文提出了一个名为结构化信息推理(SII)的新NLP任务,以解决材料科学设备级别信息提取的复杂性。我们利用已有的钙钛矿太阳能电池FAIR(Findable, Accessible, Interoperable, Reusable)数据集,调整GPT-3,获得91.8 F1分数,并更新了现有的相关科学论文数据集。所生成的数据集格式化和标准化,可以直接用作后续数据分析的输入。这个功能将使材料科学家通过选择其领域内高质量的论文,开发自己的模型。此外,我们设计了实验,在LLM中预测太阳能电池的电气性能,并通过材料基因和FAIR数据集进行反向预测参数。我们获得了与传统机器学习方法相当的性能,而无需特征选择,这证明了大型语言模型在评估材料和设计新材料方面与材料科学家相媲美的潜力。