Data has growing significance in exploring cutting-edge materials, and the number of datasets has been generated either by hand or automated approaches. However, the materials science field struggles to effectively utilize the abundance of generated data, especially in applied disciplines where materials are evaluated based on device performance rather than their properties. This article presents a new NLP task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science. We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR (Findable, Accessible, Interoperable, Reusable) dataset with 91.8% F1-score and we updated the dataset with all related scientific papers up to now. The produced data is formatted and normalized, enabling its direct utilization as input in subsequent data analysis. This feature will enable materials scientists to develop their own models by selecting high-quality review papers within their domain. Furthermore, we designed experiments to predict solar cells' electrical performance and design materials or devices with target parameters through LLM. We obtained comparable performance with traditional machine learning methods without feature selection, demonstrating the potential of LLMs to learn scientific knowledge and design new materials like a materials scientist.
翻译:数据在探索前沿材料方面具有越来越重要的意义,数据集的数量是通过人工或自动化方法生成的。然而,在评估材料性能的应用学科中,材料科学领域在有效利用生成的数据方面存在困难,因为材料是根据器件性能而不是它们的特性来评估的。本文提出了一项新的自然语言处理任务,称为结构化信息推断(SII),以应对材料科学中设备级别信息提取的复杂性。我们通过在现有的钙钛矿太阳能电池FAIR (可找到、可获取、可互用、可重用)数据集上调整GPT-3来完成这项任务,平均F1分数为91.8%,并更新了所有相关的科学论文。产生的数据已经格式化和归一化,使其可以直接用作后续数据分析的输入。这个特点将使材料科学家通过选择他们领域内的高质量评论论文来开发自己的模型。此外,我们设计了实验来预测太阳电池的电气性能,并通过LLM设计具有目标参数的材料或器件。我们获得了与传统机器学习方法相当的性能,没有特征选择,展示了LLM学习科学知识和设计新材料的潜力,就像一个材料科学家一样。