Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.
翻译:言语脑机接口(BCIs)旨在通过将神经活动转化为文本来恢复瘫痪患者的交流能力。大多数系统采用级联框架,即先解码音素,再通过n-gram语言模型(LM)组合成句子,这阻碍了所有阶段的联合优化。本文提出一种端到端的脑到文本(BIT)框架,利用单一可微分神经网络将神经活动直接翻译为连贯语句。该方法的核心是一个跨任务、跨物种预训练的神经编码器,其表征可迁移至尝试性言语和想象性言语。在采用n-gram LM的级联设置中,该预训练编码器在Brain-to-Text '24和'25基准测试中创造了新的最优性能(SOTA)。通过与音频大语言模型(LLMs)进行端到端集成,并采用对比学习进行跨模态对齐训练,BIT将先前端到端方法的词错误率(WER)从24.69%降低至10.22%。值得注意的是,我们发现小规模音频LLMs能显著提升端到端解码性能。除突破性的性能表现外,BIT通过对齐尝试性言语与想象性言语的嵌入表征,实现了跨任务泛化能力。总体而言,本研究推动了大规模多样化神经数据集的整合,为支持无缝可微分优化的端到端解码框架奠定了基础。