One of the first steps in many text-based social science studies is to retrieve documents that are relevant for the analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists risks drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder, 2017), the Social Bias Inference Corpus (SBIC) (Sap et al., 2020), and the Reuters-21578 corpus (Lewis, 1997). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1,000 documents), reaches a substantially higher retrieval performance than keyword lists.
翻译:在许多基于文本的社会科学研究中,最初的步骤之一是从大量非相关文件的集合中检索与分析相关的文件。社会科学中处理这一检索任务的常规方法是应用一套关键词,认为这些文件至少包含一个关键词。但是,使用不完整关键词列表有可能得出偏差推理。更复杂和成本更高的方法,如查询扩展技术、基于主题的模型分类规则、主动和被动监督学习等,有可能更准确地将相关内容与无关文件分开,从而减少潜在偏差。然而,采用这些更昂贵的方法是否提高了检索性能,与所有关键词列表相比,如果是的话,在比较这些关键词列表方面还不清楚。本研究缩小了这一差距,将这些方法与三个与德国推特数据集(Linder,2017年)、Scial Bias Inference Corus(Sap等人,2020年)以及路透社-21578文集(Lewis,1997年)等方法是否提高了检索性能,结果显示,与大多数基于业绩评估的分类相比,检索率技术和专题的检索标准(如果在多数情况下,则会减少业绩评估性评估性标值),则会增加。