We present Reddit Health Online Talk (RedHOT), a corpus of 22,000 richly annotated social media posts from Reddit spanning 24 health conditions. Annotations include demarcations of spans corresponding to medical claims, personal experiences, and questions. We collect additional granular annotations on identified claims. Specifically, we mark snippets that describe patient Populations, Interventions, and Outcomes (PIO elements) within these. Using this corpus, we introduce the task of retrieving trustworthy evidence relevant to a given claim made on social media. We propose a new method to automatically derive (noisy) supervision for this task which we use to train a dense retrieval model; this outperforms baseline models. Manual evaluation of retrieval results performed by medical doctors indicate that while our system performance is promising, there is considerable room for improvement. Collected annotations (and scripts to assemble the dataset), are available at https://github.com/sominw/redhot.
翻译:我们提出共22 000个内容丰富的社会媒体文章,涉及24个健康条件。说明包括了与医疗要求、个人经历和问题相对应的界限划分。我们收集了更多关于已确认要求的粒子说明。具体地说,我们标记了其中描述病人人数、干预和结果的片段。我们利用这个资料,提出了检索与在社会媒体上提出的某一要求有关的可靠证据的任务。我们提出了一种自动得出(noisy)监督任务的新办法,我们用这种方法来训练密集的检索模型;这种模型优于基线模型。医生进行的人工检索结果评估表明,虽然我们的系统表现很有希望,但仍有很大的改进余地。收集的说明(和汇编数据集的脚本)可在https://github.com/sminw/redhot查阅。