This paper proposes a end-to-end deep network to recognize kinds of accents under the same language, where we develop and transfer the deep architecture in speaker-recognition area to accent classification task for learning utterance-level accent representation. Compared with the individual-level feature in speaker-recognition, accent recognition throws a more challenging issue in acquiring compact group-level features for the speakers with the same accent, hence a good discriminative accent feature space is desired. Our deep framework adopts multitask-learning mechanism and mainly consists of three modules: a shared CNNs and RNNs based front-end encoder, a core accent recognition branch, and an auxiliary speech recognition branch, where we take speech spectrogram as input. More specifically, with the sequential descriptors learned from a shared encoder, the accent recognition branch first condenses all descriptors into an embedding vector, and then explores different discriminative loss functions which are popular in face recognition domain to enhance embedding discrimination. Additionally, due to the accent is a speaking-related timbre, adding speech recognition branch effectively curbs the over-fitting phenomenon in accent recognition during training. We show that our network without any data-augment preproccessings is significantly ahead of the baseline system on the accent classification track in the Accented English Speech Recognition Challenge 2020 (AESRC2020), where the state-of-the-art loss function Circle-Loss achieves the best discriminative optimization for accent representation.
翻译:本文提出一个端到端深的网络, 以识别同一语言下的口音种类, 我们在此开发并转移语音识别区的深层架构, 以强调分类任务, 以学习音量级口音代表。 与语音识别中的个人层面特征相比, 口音识别在为同一口音的发言者获取集束级特征时, 带来了更具有挑战性的问题, 因而需要有一个良好的区别性口音特征空间。 我们的深层次框架采用了多任务学习机制, 主要由三个模块组成: 一个共享的CNN和RNNNs的前端编码, 一个核心口音识别分支, 和一个辅助语音识别分支, 将语音光谱作为输入。 更具体地说, 与从共享的编码中学习的顺序描述器相比, 口音识别处首先将所有解码缩入嵌入一个嵌入矢量的矢量, 然后探索不同的歧视性损失功能。 由于语义的强调, 20 与语调的调调调调, 增加语音识别分支, 有效地遏制了在深度口腔识别现象中, 在培训中, 度识别 方向 。