Articulatory representation learning is the fundamental research in modeling neural speech production system. Our previous work has established a deep paradigm to decompose the articulatory kinematics data into gestures, which explicitly model the phonological and linguistic structure encoded with human speech production mechanism, and corresponding gestural scores. We continue with this line of work by raising two concerns: (1) The articulators are entangled together in the original algorithm such that some of the articulators do not leverage effective moving patterns, which limits the interpretability of both gestures and gestural scores; (2) The EMA data is sparsely sampled from articulators, which limits the intelligibility of learned representations. In this work, we propose a novel articulatory representation decomposition algorithm that takes the advantage of guided factor analysis to derive the articulatory-specific factors and factor scores. A neural convolutive matrix factorization algorithm is then employed on the factor scores to derive the new gestures and gestural scores. We experiment with the rtMRI corpus that captures the fine-grained vocal tract contours. Both subjective and objective evaluation results suggest that the newly proposed system delivers the articulatory representations that are intelligible, generalizable, efficient and interpretable.
翻译:我们先前的工作已经确立了一种深层次的范式,将动脉运动数据分解成手势,这种手势明确模拟与人类言语制作机制编码的声学和语言结构,以及相应的声学分数。我们继续这项工作,提出两个关切:(1) 动脉演化者在原始算法中相互交织,以致一些动脉演算法没有利用有效的移动模式,从而限制了手势和声波分数的可解释性;(2) EMA数据从动脉学数据中抽取少许样本,这限制了所学表现的洞察力。在这项工作中,我们提出一种新的动脉学代言分算法,利用引导要素分析的优势,得出动脉动特定因素和因子分数。然后,在因素分数中采用神经交替的矩阵要素算法,以得出新的手势和声波分数;(2) 我们实验了台柱式,以精确的方式采集了可塑的、可塑性、可塑性、可塑性、可塑性、可塑性、可塑性、可塑、可塑性、可展示的系统。