Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.
翻译:掩码扩散语言模型(DLMs)近年来已成为传统自回归模型(ARMs)的一种有前景的替代方案。DLMs采用具有双向注意力的Transformer编码器,能够在保持竞争性能的同时实现并行令牌生成。尽管其效率和有效性已得到广泛研究,但支配DLMs的内部机制在很大程度上仍未得到探索。本研究对DLM的注意力模式进行了实证分析,重点关注注意力汇现象——一种先前在多种基于Transformer的架构中观察到的效应。我们的发现表明,DLMs同样表现出注意力汇,但具有独特特征。首先,与ARMs不同,DLMs中的汇位置在生成过程中倾向于动态移动,表现出动态行为。其次,尽管ARMs对注意力汇的移除高度敏感,DLMs却保持稳健:掩蔽汇仅导致性能轻微下降。这些结果为理解基于扩散的语言模型的内在机制提供了新见解,并突显了其在注意力分配与利用方式上与自回归模型存在根本差异。