Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly for languages where creak is frequently used. This paper proposes a deep learning model to detect creaky voice in fluent speech. The model is composed of an encoder and a classifier trained together. The encoder takes the raw waveform and learns a representation using a convolutional neural network. The classifier is implemented as a multi-headed fully-connected network trained to detect creaky voice, voicing, and pitch, where the last two are used to refine creak prediction. The model is trained and tested on speech of American English speakers, annotated for creak by trained phoneticians. We evaluated the performance of our system using two encoders: one is tailored for the task, and the other is based on a state-of-the-art unsupervised representation. Results suggest our best-performing system has improved recall and F1 scores compared to previous methods on unseen data.
翻译:Vocal fried 或 creaky 声音是指一种声音质量,其特征为不规则的Glottal 开口和低调。它以多种语言出现,在美英中流行,不仅用来标出句尾语,而且用于社会语言因素和影响。由于其周期性不规律,催化声音挑战自动语音处理和识别系统,特别是经常使用creak 的语言。本文提议了一个深层次的学习模式,以探测流利的演讲中阴性的声音。模型由一位编译员和一位经过培训的分类员组成。编码器使用原始波形,并使用一个革命性神经网络学习一个代表。分类器是一个多头、全连成的网络,受过训练,可以探测creaky、发声和声道,最后两个是用来完善creak预测的。模型是用美国英语演讲进行训练和测试的,由受过训练的语音学家作加注解。我们用两种编码来评估我们的系统的表现:一个是为任务定制的原始波形图案,用一个是革命性神经网络来学习一个代表。另一个是多头的完全连接的网络,用来探测多头网络,用来测量我们以前的系统,而另一个是用一个是用一个改进了前的成绩分析系统。