Acoustic-to-articulatory inversion (AAI) aims to estimate the parameters of articulators from speech audio. There are two common challenges in AAI, which are the limited data and the unsatisfactory performance in speaker independent scenario. Most current works focus on extracting features directly from speech and ignoring the importance of phoneme information which may limit the performance of AAI. To this end, we propose a novel network called SPN that uses two different streams to carry out the AAI task. Firstly, to improve the performance of speaker-independent experiment, we propose a new phoneme stream network to estimate the articulatory parameters as the phoneme features. To the best of our knowledge, this is the first work that extracts the speaker-independent features from phonemes to improve the performance of AAI. Secondly, in order to better represent the speech information, we train a speech stream network to combine the local features and the global features. Compared with state-of-the-art (SOTA), the proposed method reduces 0.18mm on RMSE and increases 6.0% on Pearson correlation coefficient in the speaker-independent experiment. The code has been released at https://github.com/liujinyu123/AAINetwork-SPN.
翻译:声频到电动转换(AAI)旨在估计语音音频传动器参数(AAI)的参数。在AAI中,有两个共同的挑战,即数据有限和独立演讲人情景的性能不尽如人意。目前大多数工作的重点是直接从语音中提取特征,忽视电话信息的重要性,从而可能限制AAI的性能。为此,我们提议建立一个名为SPN的新网络,使用两种不同的流来执行AAI的任务。首先,为了改进独立演讲人实验的性能,我们提议一个新的电话流网络来估计作为电话功能的动脉参数。据我们所知,这是从电话中提取独立演讲人特征以改进AAI的首次工作。第二,为了更好地代表语音信息,我们培训了一个语音流网络,将当地特征和全球特征结合起来。与State-the-art(SOITA)相比,拟议方法降低了RME的0.18毫米,并在ARCE/MAVII/Scomtru实验中将PI/MASUDI/AMSI/MASO)增加6.0%的相关系数。该代码已经发布。</s>