Most of the research on data-driven speech representation learning has focused on raw audios in an end-to-end manner, paying little attention to their internal phonological or gestural structure. This work, investigating the speech representations derived from articulatory kinematics signals, uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores. By applying sparse constraints, the gestural scores leverage the discrete combinatorial properties of phonological gestures. Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully. The proposed work thus makes a bridge between articulatory phonology and deep neural networks to leverage informative, intelligible, interpretable,and efficient speech representations.
翻译:大部分关于数据驱动语音表述学习的研究都侧重于以端到端方式的原始音频,很少注意其内部声学或声学结构。这项工作调查了源于动脉感官信号的语音表述,利用混杂的稀薄矩阵因子化神经功能将动脉数据分解为可解释的手势和声学分数。通过运用稀疏的限制,声学分数利用了声学手势的离散组合特性。还进行了电话识别实验,以显示声学分数确实代码声学信息的成功。因此,拟议的工作在动脉声学和深神经网络之间架起桥梁作用,以利用信息性、可理解性、可解释性、高效的语音表述。