As a neurophysiological response to threat or adverse conditions, stress can affect cognition, emotion and behaviour with potentially detrimental effects on health in the case of sustained exposure. Since the affective content of speech is inherently modulated by an individual's physical and mental state, a substantial body of research has been devoted to the study of paralinguistic correlates of stress-inducing task load. Historically, voice stress analysis (VSA) has been conducted using conventional digital signal processing (DSP) techniques. Despite the development of modern methods based on deep neural networks (DNNs), accurately detecting stress in speech remains difficult due to the wide variety of stressors and considerable variability in the individual stress perception. To that end, we introduce a set of five datasets for task load detection in speech. The voice recordings were collected as either cognitive or physical stress was induced in the cohort of volunteers, with a cumulative number of more than a hundred speakers. We used the datasets to design and evaluate a novel self-supervised audio representation that leverages the effectiveness of handcrafted features (DSP-based) and the complexity of data-driven DNN representations. Notably, the proposed approach outperformed both extensive handcrafted feature sets and novel DNN-based audio representation learning approaches.
翻译:作为对威胁或不利条件的神经生理反应,压力可能影响认知、情感和行为,在持续接触的情况下可能对健康产生潜在有害影响。由于语言的感官内容本质上是由一个人的身心状态所调节的,因此已专门进行大量研究,研究刺激压力任务负荷的单语言相关性。历史上,语音压力分析(VSA)是使用传统数字信号处理技术进行的。尽管根据深层神经网络开发了现代方法,但由于压力因素种类繁多,个人压力感知差异很大,因此很难准确发现言语压力。为此,我们推出一套五套数据集,用于语音任务负荷检测。收集的语音录音记录,要么是在志愿者群中引起认知或身体压力,累积了一百多个发言者。我们利用数据集设计和评价了新型的自我监督的音频代表,利用了手制特征(基于DSP)的有效性以及数据驱动的音频代表方式的复杂性。