Formants are the spectral maxima that result from acoustic resonances of the human vocal tract, and their accurate estimation is among the most fundamental speech processing problems. Recent work has been shown that those frequencies can accurately be estimated using deep learning techniques. However, when presented with a speech from a different domain than that in which they have been trained on, these methods exhibit a decline in performance, limiting their usage as generic tools. The contribution of this paper is to propose a new network architecture that performs well on a variety of different speaker and speech domains. Our proposed model is composed of a shared encoder that gets as input a spectrogram and outputs a domain-invariant representation. Then, multiple decoders further process this representation, each responsible for predicting a different formant while considering the lower formant predictions. An advantage of our model is that it is based on heatmaps that generate a probability distribution over formant predictions. Results suggest that our proposed model better represents the signal over various domains and leads to better formant frequency tracking and estimation.
翻译:形成者是来自人类声道声波共振的光谱峰值,其准确估计是最基本的语音处理问题之一。最近的工作表明,这些频率可以使用深层学习技术准确估计。然而,当从不同于他们受过训练的领域的演讲中显示这些方法时,其性能下降,限制了它们作为通用工具的使用。本文件的贡献是提出一个新的网络结构,在不同不同的语句和语音领域运行良好。我们提议的模型是由共同的编码器组成,该编码器作为输入光谱和输出一个域-变量代表。然后,多个解析器进一步处理这个代表器,每个解析器负责预测不同的成形体,同时考虑较低的成形预测。我们的模型的一个优点是,它以热图为基础,产生对成形预测的概率分布。结果表明,我们提议的模型更好地代表了不同领域的信号,并导致更好的成形频率跟踪和估计。