We introduce Group SELFIES, a molecular string representation that leverages group tokens to represent functional groups or entire substructures while maintaining chemical robustness guarantees. Molecular string representations, such as SMILES and SELFIES, serve as the basis for molecular generation and optimization in chemical language models, deep generative models, and evolutionary methods. While SMILES and SELFIES leverage atomic representations, Group SELFIES builds on top of the chemical robustness guarantees of SELFIES by enabling group tokens, thereby creating additional flexibility to the representation. Moreover, the group tokens in Group SELFIES can take advantage of inductive biases of molecular fragments that capture meaningful chemical motifs. The advantages of capturing chemical motifs and flexibility are demonstrated in our experiments, which show that Group SELFIES improves distribution learning of common molecular datasets. Further experiments also show that random sampling of Group SELFIES strings improves the quality of generated molecules compared to regular SELFIES strings. Our open-source implementation of Group SELFIES is available online, which we hope will aid future research in molecular generation and optimization.
翻译:我们引入了小组SULIES,这是一个利用组号代表功能团体或整个子结构的分子字符串代表,同时保持化学稳健性保障;分子字符串代表,例如SMILES和SULIES,作为化学语言模型、深重基因模型和进化方法的分子生成和优化的基础;SMILES和SULIES利用原子代表,而SULIES则利用SULIES的化学稳健性保障,从而给代表创造更多的灵活性;此外,小组SELIES的团体标志可以利用吸收有意义的化学运动的分子碎片的诱导偏见;在我们的实验中展示了捕捉化学动力和灵活性的优势,这表明小组SULIES改进了共同分子数据集的分布学习;进一步实验还表明,小组SULIIS的随机抽样比SUFIES的常规字符串提高了所生成的分子的质量;我们小组的开放源号实施SULIES的在线服务,我们希望这将有助于分子生成和优化的未来研究。