We propose a) a Language Agnostic end-to-end Speech Translation model (LAST), and b) a data augmentation strategy to increase code-switching (CS) performance. With increasing globalization, multiple languages are increasingly used interchangeably during fluent speech. Such CS complicates traditional speech recognition and translation, as we must recognize which language was spoken first and then apply a language-dependent recognizer and subsequent translation component to generate the desired target language output. Such a pipeline introduces latency and errors. In this paper, we eliminate the need for that, by treating speech recognition and translation as one unified end-to-end speech translation problem. By training LAST with both input languages, we decode speech into one target language, regardless of the input language. LAST delivers comparable recognition and speech translation accuracy in monolingual usage, while reducing latency and error rate considerably when CS is observed.
翻译:我们建议(a) 语言Agnistic 端到端语音翻译模式(LAST), (b) 提高密码转换性能的数据增强战略(CS) 。随着全球化的日益全球化,多种语言在流畅的演讲中越来越多地被互换使用。这种CS使得传统的语音识别和翻译变得复杂,因为我们必须首先承认使用哪一种语言,然后应用一个依赖语言的识别器和随后的翻译组件来产生理想的目标语言产出。这种管道引入了延迟和错误。 在本文中,我们通过将语音识别和翻译作为统一的端到端语音翻译问题来消除了这种必要性。 通过用两种输入语言培训LAST,我们将语言解译为一种目标语言,而不管输入语言是何种语言。 LAST在使用单一语言时提供可比的识别和语音翻译准确度,同时在观察到 CS时大大降低延迟率和错误率。