Element2Vec：基于文本构建化学元素表示以支持性质预测 (Element2Vec: Build Chemical Element Representation from Text for Property Prediction)

Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.

翻译：化学元素的精确性质数据对于材料设计与制造至关重要，但由于设备限制，许多性质难以直接测量。传统方法通常借助其他元素的性质或相关性质通过数值分析进行预测，但往往难以建模复杂关系。毕竟，并非所有特征都能以标量形式表示。近期研究尝试利用语言模型等先进AI工具进行性质估算，但仍存在幻觉问题且缺乏可解释性。本文研究Element2Vec如何从自然语言中有效表征化学元素，以支持自然科学领域的研究。基于从维基百科页面解析的文本，我们使用语言模型生成单一通用嵌入向量（全局表示）和一组属性高亮向量（局部表示）。尽管元素间存在复杂关联，计算挑战还源于：1）通用描述与专业科学文本的分布差异；2）数据极度稀缺——已知元素仅118种，特定性质的数据往往高度稀疏且不完整。为此，我们设计了一种基于自注意力的测试时训练方法，以显著缓解传统回归方法导致的预测误差。本研究有望为材料科学领域AI驱动发现的新途径奠定基础。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日