使用自然语言处理和机器学习算法搜索染色体替换 (Searching for chromate replacements using natural language processing and machine learning algorithms)

The past few years has seen the application of machine learning utilised in the exploration of new materials. As in many fields of research - the vast majority of knowledge is published as text, which poses challenges in either a consolidated or statistical analysis across studies and reports. Such challenges include the inability to extract quantitative information, and in accessing the breadth of non-numerical information. To address this issue, the application of natural language processing (NLP) has been explored in several studies to date. In NLP, assignment of high-dimensional vectors, known as embeddings, to passages of text preserves the syntactic and semantic relationship between words. Embeddings rely on machine learning algorithms and in the present work, we have employed the Word2Vec model, previously explored by others, and the BERT model - applying them towards a unique challenge in materials engineering. That challenge is the search for chromate replacements in the field of corrosion protection. From a database of over 80 million records, a down-selection of 5990 papers focused on the topic of corrosion protection were examined using NLP. This study demonstrates it is possible to extract knowledge from the automated interpretation of the scientific literature and achieve expert human level insights.

翻译：过去几年来,在探索新材料过程中应用了机器学习。正如在许多研究领域一样,绝大多数知识作为文本出版,对各种研究和报告的综合或统计分析构成挑战,这些挑战包括无法提取定量信息和获取非数字信息的广度。为解决这一问题,迄今为止在几项研究中探索了自然语言处理(NLP)的应用问题。在《国家实验室计划》中,将高维矢量(称为嵌入)指定到文本的段落中,保留了各种词之间的合成和语义关系。《嵌入法》依赖机器学习算法和当前工作,我们使用了Word2Vec模型(以前由他人探索过),以及《BERT模型》应用这些模型来应对材料工程方面的独特挑战。这项挑战在于寻找腐蚀保护领域的铬替代物。从8 000多万个记录数据库中挑选出5 990份侧重于腐蚀保护主题的文件,并用NLP来审查这一模型。这项研究显示,从科学层面获取的人类知识是可能的。