Multi-resolution spectro-temporal features of a speech signal represent how the brain perceives sounds by tuning cortical cells to different spectral and temporal modulations. These features produce a higher dimensional representation of the speech signals. The purpose of this paper is to evaluate how well the auditory cortex representation of speech signals contribute to estimate articulatory features of those corresponding signals. Since obtaining articulatory features from acoustic features of speech signals has been a challenging topic of interest for different speech communities, we investigate the possibility of using this multi-resolution representation of speech signals as acoustic features. We used U. of Wisconsin X-ray Microbeam (XRMB) database of clean speech signals to train a feed-forward deep neural network (DNN) to estimate articulatory trajectories of six tract variables. The optimal set of multi-resolution spectro-temporal features to train the model were chosen using appropriate scale and rate vector parameters to obtain the best performing model. Experiments achieved a correlation of 0.675 with ground-truth tract variables. We compared the performance of this speech inversion system with prior experiments conducted using Mel Frequency Cepstral Coefficients (MFCCs).
翻译:语音信号的多分辨率光谱-时空特征代表着一个声音信号的大脑是如何通过调制气压细胞来感知声音的,将气压电池调成不同的光谱和时间调制器。这些特征产生一个更高度的语音信号表示。本文件的目的是评估声音信号的听觉皮质表现如何有助于估计这些相应信号的动脉特征。由于从声音信号的声学特征中获取动脉特征是一个具有挑战性的主题,因此我们调查了使用这种多分辨率的语音信号作为声学特征的可能性。我们使用了威斯康星X射线微波(XRMB)的清洁语音信号数据库来训练一个向向导的深神经网络(DNN)来估计六度变量的脉冲轨迹。利用适当的比例和速率矢量参数选择了一套用于培训模型的最佳多分辨率光谱-时空特征,以获得最佳的模型。实验取得了0.675与地面截面变量的对应关系。我们用先前的磁力MF(CCC)系统与先前的实验进行了对比。