Social media contains unfiltered and unique information, which is potentially of great value, but, in the case of misinformation, can also do great harm. With regards to biomedical topics, false information can be particularly dangerous. Methods of automatic fact-checking and fake news detection address this problem, but have not been applied to the biomedical domain in social media yet. We aim to fill this research gap and annotate a corpus of 1200 tweets for implicit and explicit biomedical claims (the latter also with span annotations for the claim phrase). With this corpus, which we sample to be related to COVID-19, measles, cystic fibrosis, and depression, we develop baseline models which detect tweets that contain a claim automatically. Our analyses reveal that biomedical tweets are densely populated with claims (45 % in a corpus sampled to contain 1200 tweets focused on the domains mentioned above). Baseline classification experiments with embedding-based classifiers and BERT-based transfer learning demonstrate that the detection is challenging, however, shows acceptable performance for the identification of explicit expressions of claims. Implicit claim tweets are more challenging to detect.
翻译:关于生物医学专题,虚假信息可能特别危险。自动进行事实检查和假新闻探测的方法解决了这一问题,但在社交媒体中还没有应用到生物医学领域。我们的目标是填补这一研究空白,为隐含和明确的生物医学索赔提供1200份推文(后者还附有索赔短语的横幅说明)。但是,我们抽样调查的这一材料涉及COVID-19、麻疹、细胞纤维化症和抑郁症,因此,我们开发了基线模型,以探测自动含有索赔要求的推文。我们的分析表明,生物医学推文密度很大(在一份材料样本中,45%含有1200份以上述领域为重点的推文)。与嵌入式分类器和BERT为基础的转移学习基线分类实验表明,检测表明,发现明确表达索赔要求的可接受性。隐含的推文比较难于检测。