While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, current latent reasoning models remain limited by the token-by-token generation paradigm, which suffers from compounding errors and heavy context dependency. To address these limitations, we proposed JEPA-Reasoner, a novel JEPA-based architecture enhanced with generative ability for latent reasoning. We augment this architecture with a separate action-talker model, Talker, to reconstruct human-readable text from latent representations produced by the JEPA-Reasoner. Our work demonstrated that decoupling latent-space reasoning from token production enables JEPA-Reasoner to produce mixed latent vectors, laying a foundation for multi-threaded reasoning and achieving superior robustness against compounding errors in autoregressive generation.
翻译:尽管联合嵌入预测架构(JEPA)已成为学习丰富潜在表示的有力架构,但其本质上缺乏生成能力。同时,当前的潜在推理模型仍受限于逐标记生成范式,该范式存在误差累积和严重上下文依赖问题。为应对这些局限,我们提出了JEPA-Reasoner——一种基于JEPA的新型架构,通过增强生成能力实现潜在推理。我们为该架构引入了一个独立的动作-说话器模型Talker,用于从JEPA-Reasoner生成的潜在表示中重建人类可读文本。研究表明,将潜在空间推理与标记生成解耦使得JEPA-Reasoner能够生成混合潜在向量,这为多线程推理奠定了基础,并在自回归生成中实现了对误差累积更强的鲁棒性。