Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.
翻译:化学元素的精确性质数据对于材料设计与制造至关重要,但由于设备限制,许多性质难以直接测量。传统方法通常借助其他元素的性质或相关性质通过数值分析进行预测,但往往难以建模复杂关系。毕竟,并非所有特征都能以标量形式表示。近期研究尝试利用语言模型等先进AI工具进行性质估算,但仍存在幻觉问题且缺乏可解释性。本文研究Element2Vec如何从自然语言中有效表征化学元素,以支持自然科学领域的研究。基于从维基百科页面解析的文本,我们使用语言模型生成单一通用嵌入向量(全局表示)和一组属性高亮向量(局部表示)。尽管元素间存在复杂关联,计算挑战还源于:1)通用描述与专业科学文本的分布差异;2)数据极度稀缺——已知元素仅118种,特定性质的数据往往高度稀疏且不完整。为此,我们设计了一种基于自注意力的测试时训练方法,以显著缓解传统回归方法导致的预测误差。本研究有望为材料科学领域AI驱动发现的新途径奠定基础。