Uncovering hidden graph structures underlying real-world data is a critical challenge with broad applications across scientific domains. Recently, transformer-based models leveraging the attention mechanism have demonstrated strong empirical success in capturing complex dependencies within graphs. However, the theoretical understanding of their training dynamics has been limited to tree-like graphs, where each node depends on a single parent. Extending provable guarantees to more general directed acyclic graphs (DAGs) -- which involve multiple parents per node -- remains challenging, primarily due to the difficulty in designing training objectives that enable different attention heads to separately learn multiple different parent relationships. In this work, we address this problem by introducing a novel information-theoretic metric: the kernel-guided mutual information (KG-MI), based on the $f$-divergence. Our objective combines KG-MI with a multi-head attention framework, where each head is associated with a distinct marginal transition kernel to model diverse parent-child dependencies effectively. We prove that, given sequences generated by a $K$-parent DAG, training a single-layer, multi-head transformer via gradient ascent converges to the global optimum in polynomial time. Furthermore, we characterize the attention score patterns at convergence. In addition, when particularizing the $f$-divergence to the KL divergence, the learned attention scores accurately reflect the ground-truth adjacency matrix, thereby provably recovering the underlying graph structure. Experimental results validate our theoretical findings.
翻译:揭示真实世界数据背后隐藏的图结构是一个具有广泛科学应用价值的关键挑战。近年来,基于Transformer的模型利用注意力机制在捕捉图内复杂依赖关系方面展现出强大的实证效果。然而,对其训练动态的理论理解目前仅限于树状图(每个节点仅依赖单一父节点)。将可证明的保证扩展到更一般的多父节点有向无环图(DAG)——其中每个节点涉及多个父节点——仍然具有挑战性,主要困难在于设计训练目标以使不同的注意力头能够分别学习多个不同的父子关系。本研究通过引入一种基于$f$-散度的新型信息论度量:核引导互信息(KG-MI)来解决该问题。我们的目标函数将KG-MI与多头注意力框架相结合,每个注意力头关联一个独特的边缘转移核,以有效建模多样化的父子依赖关系。我们证明,对于由$K$-父节点DAG生成的序列,通过梯度上升训练单层多头Transformer能在多项式时间内收敛至全局最优解。此外,我们刻画了收敛时注意力分数的分布模式。特别地,当$f$-散度特化为KL散度时,学习到的注意力分数能准确反映真实邻接矩阵,从而可证明地恢复底层图结构。实验结果验证了我们的理论发现。