Spiking Transformers have recently emerged as promising architectures for combining the efficiency of spiking neural networks with the representational power of self-attention. However, the lack of standardized implementations, evaluation pipelines, and consistent design choices has hindered fair comparison and principled analysis. In this paper, we introduce STEP a unified benchmark framework for Spiking Transformers that supports a wide range of tasks, including classification, segmentation, and detection across static, event-based, and sequential datasets. STEP provides modular support for diverse components such as spiking neurons, input encodings, surrogate gradients, and multiple backends (e.g., SpikingJelly, BrainCog). Using STEP, we reproduce and evaluate several representative models, and conduct systematic ablation studies on attention design, neuron types, encoding schemes, and temporal modeling capabilities. We also propose a unified analytical model for energy estimation, accounting for spike sparsity, bitwidth, and memory access, and show that quantized ANNs may offer comparable or better energy efficiency. Our results suggest that current Spiking Transformers rely heavily on convolutional frontends and lack strong temporal modeling, underscoring the need for spike-native architectural innovations. The full code is available at: https://github.com/Fancyssc/STEP
翻译:脉冲Transformer作为一种将脉冲神经网络的高效性与自注意力机制的表征能力相结合的新型架构,近年来展现出巨大潜力。然而,由于缺乏标准化的实现、评估流程以及一致的设计选择,公平比较与系统性分析一直面临挑战。本文提出了STEP,一个面向脉冲Transformer的统一基准测试框架,支持包括分类、分割与检测在内的多种任务,涵盖静态、事件驱动及序列数据集。STEP为多种组件提供了模块化支持,例如脉冲神经元、输入编码、替代梯度以及多种后端(如SpikingJelly、BrainCog)。基于STEP,我们复现并评估了若干代表性模型,并对注意力设计、神经元类型、编码方案及时序建模能力进行了系统性的消融实验。我们还提出了一个统一的能量估计分析模型,该模型考虑了脉冲稀疏性、比特宽度与内存访问,并指出量化人工神经网络可能具备相当或更优的能效。我们的结果表明,当前脉冲Transformer严重依赖卷积前端且缺乏强大的时序建模能力,这凸显了发展原生脉冲架构创新的必要性。完整代码发布于:https://github.com/Fancyssc/STEP