The automatic extraction of materials and related properties from the scientific literature is gaining attention in data-driven materials science (Materials Informatics). In this paper, we discuss Grobid-superconductors, our solution for automatically extracting superconductor material names and respective properties from text. Built as a Grobid module, it combines machine learning and heuristic approaches in a multi-step architecture that supports input data as raw text or PDF documents. Using Grobid-superconductors, we built SuperCon2, a database of 40324 materials and properties records from 37700 papers. The material (or sample) information is represented by name, chemical formula, and material class, and is characterized by shape, doping, substitution variables for components, and substrate as adjoined information. The properties include the Tc superconducting critical temperature and, when available, applied pressure with the Tc measurement method.
翻译:科学文献中材料和相关特性的自动提取在数据驱动材料科学(材料信息学)中日益引起注意。在本文中,我们讨论了格罗比特超级导体,这是我们自动从文本中提取超导体材料名称和各自属性的解决方案。作为一个格罗比模块,它结合了支持作为原始文本或PDF文件输入数据的多步结构中的机器学习和超导法方法。我们利用格罗比特超级导体建立了超级Con2数据库,由377份论文中的40324种材料和属性记录组成。材料(或样本)信息以名称、化学公式和材料类别为代表,其特征为形状、剂量、部件替代变量和相邻信息。这些特性包括超导导临界温度,如果有的话,则使用Tc测量方法施压。