基于端到端脑到文本神经接口的内部言语解码 (Decoding inner speech with an end-to-end brain-to-text neural interface)

Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

翻译：言语脑机接口（BCIs）旨在通过将神经活动转化为文本来恢复瘫痪患者的交流能力。大多数系统采用级联框架，即先解码音素，再通过n-gram语言模型（LM）组合成句子，这阻碍了所有阶段的联合优化。本文提出一种端到端的脑到文本（BIT）框架，利用单一可微分神经网络将神经活动直接翻译为连贯语句。该方法的核心是一个跨任务、跨物种预训练的神经编码器，其表征可迁移至尝试性言语和想象性言语。在采用n-gram LM的级联设置中，该预训练编码器在Brain-to-Text '24和'25基准测试中创造了新的最优性能（SOTA）。通过与音频大语言模型（LLMs）进行端到端集成，并采用对比学习进行跨模态对齐训练，BIT将先前端到端方法的词错误率（WER）从24.69%降低至10.22%。值得注意的是，我们发现小规模音频LLMs能显著提升端到端解码性能。除突破性的性能表现外，BIT通过对齐尝试性言语与想象性言语的嵌入表征，实现了跨任务泛化能力。总体而言，本研究推动了大规模多样化神经数据集的整合，为支持无缝可微分优化的端到端解码框架奠定了基础。