Behavioral healthcare risk assessment remains a challenging problem due to the highly multimodal nature of patient data and the temporal dynamics of mood and affective disorders. While large language models (LLMs) have demonstrated strong reasoning capabilities, their effectiveness in structured clinical risk scoring remains unclear. In this work, we introduce HARBOR, a behavioral health aware language model designed to predict a discrete mood and risk score, termed the Harbor Risk Score (HRS), on an integer scale from -3 (severe depression) to +3 (mania). We also release PEARL, a longitudinal behavioral healthcare dataset spanning four years of monthly observations from three patients, containing physiological, behavioral, and self reported mental health signals. We benchmark traditional machine learning models, proprietary LLMs, and HARBOR across multiple evaluation settings and ablations. Our results show that HARBOR outperforms classical baselines and off the shelf LLMs, achieving 69 percent accuracy compared to 54 percent for logistic regression and 29 percent for the strongest proprietary LLM baseline.
翻译:行为医疗保健风险评估仍然是一个具有挑战性的问题,这源于患者数据的高度多模态特性以及情绪与情感障碍的时间动态性。尽管大语言模型(LLMs)已展现出强大的推理能力,但其在结构化临床风险评分中的有效性尚不明确。在本工作中,我们提出了HARBOR,一个具备行为健康感知能力的语言模型,旨在预测一个离散的情绪与风险评分,即Harbor风险评分(HRS),其范围为从-3(重度抑郁)到+3(躁狂)的整数尺度。我们还发布了PEARL,这是一个纵向行为医疗保健数据集,涵盖了三名患者为期四年的月度观察记录,包含生理、行为及自我报告的心理健康信号。我们在多种评估设置和消融实验中,对传统机器学习模型、专有大语言模型以及HARBOR进行了基准测试。我们的结果表明,HARBOR在性能上超越了经典基线模型和现成的大语言模型,达到了69%的准确率,而逻辑回归的准确率为54%,最强的专有大语言模型基线准确率仅为29%。