In this article, we investigate whispered-to natural-speech conversion method using sequence to sequence generation approach by proposing modified transformer architecture. We investigate different kinds of features such as mel frequency cepstral coefficients (MFCCs) and smoothed spectral features. The network is trained end-to-end (E2E) using supervised approach. We investigate the effectiveness of embedded auxillary decoder used after N encoder sub-layers, and is trained with the frame level objective function for identifying source phoneme labels. We predict target audio features and generate audio using these for testing. We test on standard wTIMIT dataset and CHAINS dataset. We report results as word-error-rate (WER) generated by using automatic speech recognition (ASR) system and also BLEU scores. %intelligibility and naturalness using mean opinion score and additionally using word error rate using automatic speech recognition system. In addition, we measure spectral shape of an output speech signal by measuring formant distributions w.r.t the reference speech signal, at frame level. In relation to this aspect, we also found that the whispered-to-natural converted speech formants probability distribution is closer to ground truth distribution. To the authors' best knowledge, this is the first time transformer with auxiliary decoder has been applied for whispered-to-natural speech conversion. [This pdf is TASLP submission draft version 1.0, 14th April 2020.]
翻译:在本篇文章中,我们通过提出修改变压器结构,对使用序列到序列生成方法的顺序进行低语到自然语音转换方法进行调查。我们通过建议修改变压器结构,调查不同特征,例如Mel频率 Cepstral系数(MFCCs)和平滑光谱特征。网络是使用监督方法培训的端对端(E2E)的。我们用N 编码器子次层后使用的内嵌助燃解码器(E2E2E),并用框架水平目标功能来识别源电话标签。我们预测目标音频功能并利用这些功能生成音频。我们测试标准 wTIMIT 数据集和 CHAINS 数据集。我们用自动语音识别系统(ASR) 和BLEU的分数来报告结果。我们调查了在N 编码分数子层后使用的内嵌入式助燃解码解码器的有效性,并用自动语音识别系统来测量输出语音语音信号的光谱形状。我们在框架水平水平的语音表达器上,我们发现,在框架水平水平上,我们使用这种语言流流流转换的流流流流流流流到这个数据流流流到流到流流流流向流流流流流的版本。