发声过程中的深层代表性学习:挑战、最新进展和未来趋势 (Deep Representation Learning in Speech Processing: Challenges, Recent Advances, and Future Trends)

Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motivated the adoption of a recent trend in speech community towards utilisation of representation learning techniques, which can learn an intermediate representation of the input signal automatically that better suits the task at hand and hence lead to improved performance. The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech -- a gap that our survey aims to bridge.

翻译：语言处理研究传统上认为,设计人工设计的声学特征(地物工程)的任务与设计高效机器学习模型以作出预测和分类决定的任务是一个不同的问题,而设计人工设计的声学特征(地物工程)的任务与设计高效机器学习模型以作出预测和分类决定的任务是分开的。这种方法有两个主要的缺点:第一,特征工程是繁琐的,需要人的知识;第二,设计特征可能并非最有利于目前的目标。这促使在语音社区中采取最近的趋势,利用代表性学习技术,可以自动了解输入信号的中间表达方式,从而自动地更好地适应手头的任务,从而导致业绩的改善。随着深层次学习(DL)的进展,代表性学习的意义已经增加。在深层次的学习(DL)方面,表现更加有用,更不依赖于人的知识,因此对分类、预测等任务非常有利。本文的主要贡献是,通过将分散的研究集中在三个不同的研究领域,包括自动语音识别、议长承认和情感识别(SER)和演讲人称声学的中间信号。最近对演讲作了审查,但从ASR的桥梁到我们的演讲的学习目的没有。

相关内容

表示学习

关注 186

表示学习是通过利用训练数据来学习得到向量表示，这可以克服人工方法的局限性。表示学习通常可分为两大类，无监督和有监督表示学习。大多数无监督表示学习方法利用自动编码器（如去噪自动编码器和稀疏自动编码器等）中的隐变量作为表示。目前出现的变分自动编码器能够更好的容忍噪声和异常值。然而，推断给定数据的潜在结构几乎是不可能的。目前有一些近似推断的策略。此外，一些无监督表示学习方法旨在近似某种特定的相似性度量。提出了一种无监督的相似性保持表示学习框架，该框架使用矩阵分解来保持成对的DTW相似性。通过学习保持DTW的shaplets，即在转换后的空间中的欧式距离近似原始数据的真实DTW距离。有监督表示学习方法可以利用数据的标签信息，更好地捕获数据的语义结构。孪生网络和三元组网络是目前两种比较流行的模型，它们的目标是最大化类别之间的距离并最小化了类别内部的距离。

对话管理的综述论文:最近的进展和挑战，A Survey on Dialog Management: Recent Advances and Challenges

专知会员服务

83+阅读 · 2020年5月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日