Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.
翻译:流式语音到文本翻译(StreamST)需要在接收语音信号的同时实时生成译文,这带来了严格的延迟约束,并要求模型在基于部分信息进行决策与保持高质量翻译之间取得平衡。该领域的研究工作迄今主要依赖于SimulEval代码库,但该库已停止维护,且不支持可修订输出的系统。此外,其设计初衷是模拟短音频片段的处理,而非长音频流,也未提供便捷的系统演示方法。为此,我们推出了simulstream——首个专用于流式语音到文本翻译系统统一评估与演示的开源框架。该框架专为长语音流处理设计,不仅支持增量解码方法,还支持重翻译方法,从而可在同一框架内对各类系统在翻译质量和延迟方面进行比较。此外,它还提供了一个交互式网页界面,用于演示基于该工具构建的任何系统。