Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social media mainly focus on text processing, and only a few also utilize images posted by users. In this work, we propose a flexible time-enriched multimodal transformer architecture for detecting depression from social media posts, using pretrained models for extracting image and text embeddings. Our model operates directly at the user-level, and we enrich it with the relative time between posts by using time2vec positional embeddings. Moreover, we propose another model variant, which can operate on randomly sampled and unordered sets of posts to be more robust to dataset noise. We show that our method, using EmoBERTa and CLIP embeddings, surpasses other methods on two multimodal datasets, obtaining state-of-the-art results of 0.931 F1 score on a popular multimodal Twitter dataset, and 0.902 F1 score on the only multimodal Reddit dataset.
翻译:从互联网上用户生成的内容中检测抑郁症的当前方法主要侧重于文本处理,只有少数人还利用用户张贴的图像。在这项工作中,我们建议建立一个灵活的、时间丰富的多式联运变压器结构,用于从社交媒体文章中检测抑郁症,使用预先培训的模型提取图像和文本嵌入器。我们的模型直接在用户一级运作,我们通过使用时间2Vec定位嵌入器来丰富它,用不同职位之间的相对时间来丰富它。此外,我们提出另一个模型变式,这种变式可以随机抽样和未经排序的成套员额操作,以便更有力地控制数据集的噪音。我们展示了我们的方法,即使用EMOBERTA和CLIP嵌入器,超过两个多式联运数据集的其他方法,在流行的多式联运Twitter数据集中只获得0.931 F1分的最新成绩,在Red Mostaldaldrodrodset上只获得0.902 F分的成绩。