The end-to-end architecture has made promising progress in speech translation (ST). However, the ST task is still challenging under low-resource conditions. Most ST models have shown unsatisfactory results, especially in the absence of word information from the source speech utterance. In this study, we survey methods to improve ST performance without using source transcription, and propose a learning framework that utilizes a language-independent universal phone recognizer. The framework is based on an attention-based sequence-to-sequence model, where the encoder generates the phonetic embeddings and phone-aware acoustic representations, and the decoder controls the fusion of the two embedding streams to produce the target token sequence. In addition to investigating different fusion strategies, we explore the specific usage of byte pair encoding (BPE), which compresses a phone sequence into a syllable-like segmented sequence. Due to the conversion of symbols, a segmented sequence represents not only pronunciation but also language-dependent information lacking in phones. Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline, and the performance is close to that of the existing best method using source transcription.
翻译:语言翻译(ST) 端到端结构在语音翻译(ST) 方面取得了大有希望的进展。 但是,ST任务在低资源条件下仍然具有挑战性。 多数ST模型显示的结果不尽如人意, 特别是在没有源言言语述说的文字信息的情况下。 在这项研究中, 我们调查如何改进ST性能, 不使用源代码转录, 并提议一个使用语言独立通用电话识别器的学习框架。 该框架基于基于关注的顺序到序列模式, 编码器生成语音嵌入和电话觉声表, 解码器控制了两个嵌入流的融合, 以生成目标符号序列。 除了调查不同的聚合战略外, 我们探索如何具体使用字节配对编码( BBE ), 将一个电话序列压缩成一个可交集式的通用电话识别器。 由于符号转换, 分段序列不仅代表读音, 也代表手机中缺少语言依赖的信息。 在Fisher Eng 和 Taigigi- Mandar Coria Cora 进行实验, 显示我们使用的最佳方法超越了目前的业绩记录来源。