Mental health disorders affect millions worldwide, yet early detection remains a major challenge, particularly for Arabic-speaking populations where resources are limited and mental health discourse is often discouraged due to cultural stigma. While substantial research has focused on English-language mental health detection, Arabic remains significantly underexplored, partly due to the scarcity of annotated datasets. We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts. The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group. CARMA surpasses existing resources in both scale and diversity. We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions. To demonstrate the dataset's potential for further mental health analysis, we perform classification experiments using a range of models, from shallow classifiers to large language models. Our results highlight the promise of advancing mental health detection in underrepresented languages such as Arabic.
翻译:心理健康障碍影响着全球数百万人,然而早期检测仍是一项重大挑战,尤其对于阿拉伯语人群而言,由于资源有限且文化污名常阻碍心理健康讨论,这一问题更为突出。尽管已有大量研究聚焦于英语心理健康检测,阿拉伯语在此领域仍显著缺乏探索,部分原因在于标注数据集的稀缺。本文提出CARMA,首个自动标注的大规模阿拉伯语Reddit帖子数据集。该数据集涵盖焦虑症、自闭症、抑郁症等六种心理健康状况及一个对照组。CARMA在规模和多样性上均超越现有资源。我们通过定性与定量分析,探究用户间词汇与语义差异,揭示了特定心理健康状况的语言标记特征。为展示该数据集在进一步心理健康分析中的潜力,我们使用从浅层分类器到大型语言模型的一系列模型进行分类实验。研究结果突显了在阿拉伯语等代表性不足语言中推进心理健康检测的前景。