Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding increasingly memory-bound, as KV-cache reads and writes dominate inference throughput rather than arithmetic computation. We propose Parallel Hierarchical Operation for Top-down Networks (PHOTON), a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder progressively compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, offering significant advantages in long-context and multi-query tasks. This reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.
翻译:Transformer 作为水平逐词元扫描器运行;在每个生成步骤中,模型需关注不断增长的词元级状态序列。这种访问模式增加了预填充延迟,并使长上下文解码日益受内存限制,因为 KV 缓存的读写操作而非算术计算主导了推理吞吐量。我们提出并行分层操作自上而下网络(PHOTON),一种分层自回归模型,以垂直、多分辨率上下文访问替代平面扫描。PHOTON 维护一个潜在流层次结构:自底向上编码器逐步将词元压缩为低速率上下文状态,而轻量级自顶向下解码器则重建细粒度词元表示。实验结果表明,在吞吐量与质量的权衡方面,PHOTON 优于基于 Transformer 的竞争性语言模型,在长上下文和多查询任务中展现出显著优势。这减少了解码时的 KV 缓存流量,实现了高达 $10^{3}\times$ 的单位内存吞吐量提升。