Inducing semantic representations directly from speech signals is a highly challenging task but has many useful applications in speech mining and spoken language understanding. This study tackles the unsupervised learning of semantic representations for spoken utterances. Through converting speech signals into hidden units generated from acoustic unit discovery, we propose WavEmbed, a multimodal sequential autoencoder that predicts hidden units from a dense representation of speech. Secondly, we also propose S-HuBERT to induce meaning through knowledge distillation, in which a sentence embedding model is first trained on hidden units and passes its knowledge to a speech encoder through contrastive learning. The best performing model achieves a moderate correlation (0.5~0.6) with human judgments, without relying on any labels or transcriptions. Furthermore, these models can also be easily extended to leverage textual transcriptions of speech to learn much better speech embeddings that are strongly correlated with human annotations. Our proposed methods are applicable to the development of purely data-driven systems for speech mining, indexing and search.
翻译:直接从语音信号中引出语义表达方式是一项极具挑战性的任务,但在语音挖掘和口头语言理解方面有许多有用的应用。本研究解决了未经监督的语音表达方式的学习问题。通过将语音信号转换成声学单位发现产生的隐藏单元,我们提议WavEmbed,这是一个多式连续自动编码器,预测从密集的言语表达方式中隐藏的单位。第二,我们还提议S-HuBERT通过知识蒸馏来产生意义,其中一句嵌入式的句子首先在隐藏的单元上接受培训,并通过对比性学习将其知识传递给语音编码器。最佳表现模型与人类判断的中度相关性(0.5~0.6),而不必依赖任何标签或抄录。此外,这些模型还可以很容易扩展,以便利用文字文字抄录来学习与人类描述密切相关的更好语言嵌入。我们建议的方法适用于开发纯数据驱动的语音挖掘、索引和搜索系统。