The ever-increasing number of materials science articles makes it hard to infer chemistry-structure-property relations from published literature. We used natural language processing (NLP) methods to automatically extract material property data from the abstracts of polymer literature. As a component of our pipeline, we trained MaterialsBERT, a language model, using 2.4 million materials science abstracts, which outperforms other baseline models in three out of five named entity recognition datasets when used as the encoder for text. Using this pipeline, we obtained ~300,000 material property records from ~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells to recover non-trivial insights. The data extracted through our pipeline is made available through a web platform at https://polymerscholar.org which can be used to locate material property data recorded in abstracts conveniently. This work demonstrates the feasibility of an automatic pipeline that starts from published literature and ends with a complete set of extracted material property information.
翻译:越来越多的材料科学文章使得很难从已出版的文献中推断化学-结构-财产关系。我们使用自然语言处理方法自动从聚合物文献摘要中提取物质财产数据。作为我们管道的一个组成部分,我们培训了MealBERT这一语言模型,使用240万材料科学摘要,这比作为文本编码器时五个名称的实体识别数据集中的三个模型中的其他基线模型要强。我们利用这一管道在60小时内从~130,000摘要中获得了大约300,000份材料财产记录。为多种应用,如燃料电池、超级电容器和聚合物太阳能电池等,对数据进行了分析,以恢复非三重洞见。通过我们管道提取的数据通过https://mumolscholar.org的网络平台提供,该平台可用于查找摘要中记录的物质财产数据。这项工作表明从出版的文献开始的自动管道的可行性,最后提供整套提取材料财产资料。