In recent years, there has been a surge of interest in research on automatic mental health detection (MHD) from social media data leveraging advances in natural language processing and machine learning techniques. While significant progress has been achieved in this interdisciplinary research area, the vast majority of work has treated MHD as a binary classification task. The multiclass classification setup is, however, essential if we are to uncover the subtle differences among the statistical patterns of language use associated with particular mental health conditions. Here, we report on experiments aimed at predicting six conditions (anxiety, attention deficit hyperactivity disorder, bipolar disorder, post-traumatic stress disorder, depression, and psychological stress) from Reddit social media posts. We explore and compare the performance of hybrid and ensemble models leveraging transformer-based architectures (BERT and RoBERTa) and BiLSTM neural networks trained on within-text distributions of a diverse set of linguistic features. This set encompasses measures of syntactic complexity, lexical sophistication and diversity, readability, and register-specific ngram frequencies, as well as sentiment and emotion lexicons. In addition, we conduct feature ablation experiments to investigate which types of features are most indicative of particular mental health conditions.
翻译:近年来,社会媒体数据对利用自然语言处理和机器学习技术的进展进行自动心理健康检测(MHD)的研究表现出浓厚的兴趣。尽管在这一跨学科研究领域取得了显著进展,但绝大多数工作都把MHD视为二元分类任务。然而,多级分类设置对于我们发现与特定心理健康条件有关的语言使用统计模式之间的细微差异至关重要。我们在这里报告旨在预测再应用社交媒体职位的六种条件(焦虑、注意力不足的过度活跃性、双极障碍、创伤后应激障碍、抑郁和心理压力)的实验。我们探索并比较混合和共振模型的性能,利用基于变异器的建筑(BERT和RobERTA)和BILSTM神经网络,这些网络在各种语言特征的文本内分布方面受过培训。这套实验包括综合复杂性、精密性和多样性、可读性、可读性和具体注册的恩克频率等措施,以及情绪和情感和情感立变异性。此外,我们还进行了最具有指示性性的实验,以调查各种健康特征。