Transformer-based autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) tokenizing molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) designing an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. InertialAR devises a canonical tokenization that aligns molecules to their inertial frames and reorders atoms to ensure SE(3) and permutation invariance. Moreover, InertialAR equips the attention mechanism with geometric awareness via geometric rotary positional encoding (GeoRoPE). In addition, it utilizes a hierarchical autoregressive paradigm to predict the next atom-based token, predicting the atom type first and then its 3D coordinates via Diffusion loss. Experimentally, InertialAR achieves state-of-the-art performance on 7 of the 10 evaluation metrics for unconditional molecule generation across QM9, GEOM-Drugs, and B3LYP. Moreover, it significantly outperforms strong baselines in controllable generation for targeted chemical functionality, attaining state-of-the-art results across all 5 metrics.
翻译:基于Transformer的自回归模型已成为文本和图像等模态的统一范式,但其在三维分子生成领域的扩展仍待深入探索。这一差距源于两个根本性挑战:(1)将分子标记化为规范的一维标记序列,使其对SE(3)变换和原子索引置换具有不变性;(2)设计一种能够建模混合原子标记的架构,该标记需耦合离散原子类型与连续三维坐标。为应对这些挑战,我们提出了InertialAR。InertialAR设计了一种规范标记化方法,通过将分子对齐至其惯性框架并重新排序原子,确保SE(3)和置换不变性。此外,InertialAR通过几何旋转位置编码(GeoRoPE)为注意力机制赋予几何感知能力。同时,它采用分层自回归范式预测下一个原子标记:先预测原子类型,再通过扩散损失预测其三维坐标。实验表明,在QM9、GEOM-Drugs和B3LYP数据集的无条件分子生成任务中,InertialAR在10项评估指标中的7项上达到了最先进性能。此外,在面向目标化学功能的可控生成任务中,它显著优于强基线模型,在所有5项指标上均取得了最优结果。