Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor mappings proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to learn how to control an articulatory synthesizer. The synthesizer takes as input control signals consisting of six vocal Tract Variables (TVs) and source features (voicing indicators and pitch), and generates the corresponding auditory spectrograms. Due to the non-linear structure of the synthesizer, the control parameters that produce a target speech signal are not readily computable nor are they always unique. Here we demonstrate how to initialize the MirrorNet learning so as to produce a meaningful range of articulatory values. Once trained, the MirrorNet successfully estimates the TVs and source features needed to synthesize any arbitrary speech utterance. This approach outperforms the best previously designed `speech inversion' systems on the Wisconsin X-ray microbeam (XRMB) dataset.
翻译:学习这些复杂的感官模型绘图工作同时进行,而且往往以不受监督或半监督的方式进行。 由这种感官模型学习的自动编码结构(MirorNet)在这项工作中探索了受此感官模型启发的自动编码结构(MirorNet),以学习如何控制动脉合成器。 合成器将由六个声轨变异(TV)和源特性(显示指标和声频)组成的输入控制信号作为输入控制信号,并生成相应的听觉光谱。 由于合成器的非线性结构,产生目标语音信号的控制参数不易进行可调和,而且始终是独一无二的。 这里我们展示了如何启动镜网学习,以便产生一系列有意义的动脉图值。经过培训后, 镜网成功估计了合成任何任意语音表达所需的电视和源特性。 这个方法超越了以前设计的最佳的威斯康辛X射线微波谱数据集的“语音”系统。