Depression is the most common psychological disorder and is considered as a leading cause of disability and suicide worldwide. An automated system capable of detecting signs of depression in human speech can contribute to ensuring timely and effective mental health care for individuals suffering from the disorder. Developing such automated system requires accurate machine learning models, capable of capturing signs of depression. However, state-of-the-art models based on deep acoustic representations require abundant data, meticulous selection of features, and rigorous training; the procedure involves enormous computational resources. In this work, we explore the effectiveness of two different acoustic feature groups - conventional hand-curated and deep representation features, for predicting the severity of depression from speech. We explore the relevance of possible contributing factors to the models' performance, including gender of the individual, severity of the disorder, content and length of speech. Our findings suggest that models trained on conventional acoustic features perform equally well or better than the ones trained on deep representation features at significantly lower computational cost, irrespective of other factors, e.g. content and length of speech, gender of the speaker and severity of the disorder. This makes such models a better fit for deployment where availability of computational resources is restricted, such as real time depression monitoring applications in smart devices.
翻译:一种能够发现人类言语抑郁症状的自动化系统有助于确保为患有抑郁症的人提供及时和有效的精神保健。发展这种自动化系统需要精确的机器学习模型,能够捕捉抑郁症症状。然而,基于深层声学表现的最先进的模型需要大量的数据、仔细地选择特征和严格的培训;程序涉及巨大的计算资源。在这项工作中,我们探讨两种不同的声学特征组――传统的手写和深层代表特征组――对预测言语抑郁症严重程度的有效性。我们探讨可能的因素对于模型性能的相关性,包括个人性别、混乱症的严重程度、内容和发言长度等。我们的研究结果表明,在传统声学特征方面受过培训的模型与在低得多的计算成本下深层代表特征方面受过培训的模型同样好或更好,而不论其他因素,例如内容和发言长度、演讲者性别和混乱症的严重程度。这使得这种模型更适合在计算资源有限的情况下部署,例如智能的抑郁症监测装置。