Research on speech processing has traditionally considered the task of designing hand-engineered acoustic features (feature engineering) as a separate distinct problem from the task of designing efficient machine learning (ML) models to make prediction and classification decisions. There are two main drawbacks to this approach: firstly, the feature engineering being manual is cumbersome and requires human knowledge; and secondly, the designed features might not be best for the objective at hand. This has motivated the adoption of a recent trend in speech community towards utilisation of representation learning techniques, which can learn an intermediate representation of the input signal automatically that better suits the task at hand and hence lead to improved performance. The significance of representation learning has increased with advances in deep learning (DL), where the representations are more useful and less dependent on human knowledge, making it very conducive for tasks like classification, prediction, etc. The main contribution of this paper is to present an up-to-date and comprehensive survey on different techniques of speech representation learning by bringing together the scattered research across three distinct research areas including Automatic Speech Recognition (ASR), Speaker Recognition (SR), and Speaker Emotion Recognition (SER). Recent reviews in speech have been conducted for ASR, SR, and SER, however, none of these has focused on the representation learning from speech -- a gap that our survey aims to bridge.
翻译:语言处理研究传统上认为,设计人工设计的声学特征(地物工程)的任务与设计高效机器学习模型以作出预测和分类决定的任务是一个不同的问题,而设计人工设计的声学特征(地物工程)的任务与设计高效机器学习模型以作出预测和分类决定的任务是分开的。这种方法有两个主要的缺点:第一,特征工程是繁琐的,需要人的知识;第二,设计特征可能并非最有利于目前的目标。这促使在语音社区中采取最近的趋势,利用代表性学习技术,可以自动了解输入信号的中间表达方式,从而自动地更好地适应手头的任务,从而导致业绩的改善。随着深层次学习(DL)的进展,代表性学习的意义已经增加。在深层次的学习(DL)方面,表现更加有用,更不依赖于人的知识,因此对分类、预测等任务非常有利。本文的主要贡献是,通过将分散的研究集中在三个不同的研究领域,包括自动语音识别、议长承认和情感识别(SER)和演讲人称声学的中间信号。最近对演讲作了审查,但从ASR的桥梁到我们的演讲的学习目的没有。