Research shows that exposure to suicide-related news media content is associated with suicide rates, with some content characteristics likely having harmful and others potentially protective effects. Although good evidence exists for a few selected characteristics, systematic large scale investigations are missing in general, and in particular for social media data. We apply machine learning methods to automatically label large quantities of Twitter data. We developed a novel annotation scheme that classifies suicide-related tweets into different message types and problem- vs. solution-focused perspectives. We then trained a benchmark of machine learning models including a majority classifier, an approach based on word frequency (TF-IDF with a linear SVM) and two state-of-the-art deep learning models (BERT, XLNet). The two deep learning models achieved the best performance in two classification tasks: First, we classified six main content categories, including personal stories about either suicidal ideation and attempts or coping, calls for action intending to spread either problem awareness or prevention-related information, reportings of suicide cases, and other suicide-related and off-topic tweets. The deep learning models reach accuracy scores above 73% on average across the six categories, and F1-scores in between 69% and 85% for all but the suicidal ideation and attempts category (55%). Second, in separating postings referring to actual suicide from off-topic tweets, they correctly labelled around 88% of tweets, with BERT achieving F1-scores of 93% and 74% for the two categories. These classification performances are comparable to the state-of-the-art on similar tasks. By making data labeling more efficient, this work enables future large-scale investigations on harmful and protective effects of various kinds of social media content on suicide rates and on help-seeking behavior.
翻译:研究表明,接触与自杀有关的新闻媒体内容与自杀率有关,有些内容特征可能有害,而另一些则可能产生保护效应。虽然有好的证据存在,但有少数选定的特征,普遍缺乏系统性大规模调查,特别是社交媒体数据。我们运用机器学习方法自动标注大量推特数据。我们开发了一个新颖的批注计划,将与自杀有关的推特内容分为不同的信息类型和问题与解决办法的观点。我们随后培训了一个机器学习模式的基准,包括一个多数分类,一种基于字频(TF-IDF带有线性SVM)和两个最先进的深层次学习模式(BERT, XLNet)的做法。两种深层次的学习模式在两种分类任务中取得了最佳的绩效:第一,我们分类了六大主要内容类别,包括自杀思想和尝试的个人故事或应对,呼吁采取行动,以传播问题意识或预防相关信息,报告自杀案件,以及其它自杀相关和离题的推文推文推文。深学习模型在六大类中达到73%以上的准确分级,但在连续六大类中进行排序的自杀式推算,而F1级的推算结果中,所有38级的推算的推算数据在排序中,所有的推算的推算的推算中,在全部的推算为85次的推算的推算的推算的推算中,所有的推算的推算的推算为85的推算为85的推算的推算中,所有的推算的推算的推算的推算的推算中,所有。