Embedding acoustic information into fixed length representations is of interest for a whole range of applications in speech and audio technology. Two novel unsupervised approaches to generate acoustic embeddings by modelling of acoustic context are proposed. The first approach is a contextual joint factor synthesis encoder, where the encoder in an encoder/decoder framework is trained to extract joint factors from surrounding audio frames to best generate the target output. The second approach is a contextual joint factor analysis encoder, where the encoder is trained to analyse joint factors from the source signal that correlates best with the neighbouring audio. To evaluate the effectiveness of our approaches compared to prior work, two tasks are conducted -- phone classification and speaker recognition -- and test on different TIMIT data sets. Experimental results show that one of the proposed approaches outperforms phone classification baselines, yielding a classification accuracy of 74.1%. When using additional out-of-domain data for training, an additional 3% improvements can be obtained, for both for phone classification and speaker recognition tasks.
翻译:将声学信息嵌入固定长度表示法对于语音和音频技术的一系列应用都有意义。 提出了两种通过模拟声学背景生成声学嵌入器的新颖、 不受监督的方法。 第一种方法是背景因素合成编码器, 使编码器/解码器框架中的编码器经过培训, 从周围音频框中提取联合要素, 以最佳生成目标输出。 第二种方法是背景因素联合要素分析编码器, 对编码器进行培训, 以分析来源信号中与邻近音频最相关的联合要素。 为了评估我们方法与先前工作相比的有效性, 开展了两项任务 -- -- 电话分类和语音识别 -- 并测试了不同的TIMEX数据集。 实验结果显示, 一项拟议方法超越了电话分类基线, 得出了74.1%的分类准确度。 在使用额外外部数据进行培训时, 可以在电话分类和语音识别任务方面另外获得3%的改进。