Knowledge of mixtures' phase equilibria is crucial in nature and technical chemistry. Phase equilibria calculations of mixtures require activity coefficients. However, experimental data on activity coefficients is often limited due to high cost of experiments. For an accurate and efficient prediction of activity coefficients, machine learning approaches have been recently developed. However, current machine learning approaches still extrapolate poorly for activity coefficients of unknown molecules. In this work, we introduce the SMILES-to-Properties-Transformer (SPT), a natural language processing network to predict binary limiting activity coefficients from SMILES codes. To overcome the limitations of available experimental data, we initially train our network on a large dataset of synthetic data sampled from COSMO-RS (10 Million data points) and then fine-tune the model on experimental data (20 870 data points). This training strategy enables SPT to accurately predict limiting activity coefficients even for unknown molecules, cutting the mean prediction error in half compared to state-of-the-art models for activity coefficient predictions such as COSMO-RS, UNIFAC, and improving on recent machine learning approaches.
翻译:混合物的相位平衡知识在性质和技术化学中至关重要。混合物的相位平衡计算需要活动系数。但是,由于试验成本高,活动系数的实验数据往往有限。为了准确和有效地预测活动系数,最近制定了机器学习方法。然而,目前的机器学习方法仍然对未知分子的活动系数进行极差的推断。在这项工作中,我们引入了SMILES-Property-Transtrader(SPT),这是一个自然语言处理网络,用来预测SMILES编码的二进制限制活动系数。为了克服现有实验数据的局限性,我们最初就对我们的网络进行了关于从COSMO-RS抽样的大型合成数据集的培训(1,000万个数据点),然后对实验数据模型(20,870个数据点)进行了微调。这一培训战略使小组委员会能够准确预测即使是未知分子的活动系数限值,并减少预测COSMO-RS、UNIFAC等最新活动系数模型的半中的平均预测误差,改进了最近的机器学习方法。