新闻报道中选择偏见的语料库规模发现：从实体报道中比较不同来源的观点 (Towards Corpus-Scale Discovery of Selection Biases in News Coverage: Comparing What Sources Say About Entities as a Start)

News sources undergo the process of selecting newsworthy information when covering a certain topic. The process inevitably exhibits selection biases, i.e. news sources' typical patterns of choosing what information to include in news coverage, due to their agenda differences. To understand the magnitude and implications of selection biases, one must first discover (1) on what topics do sources typically have diverging definitions of "newsworthy" information, and (2) do the content selection patterns correlate with certain attributes of the news sources, e.g. ideological leaning, etc. The goal of the paper is to investigate and discuss the challenges of building scalable NLP systems for discovering patterns of media selection biases directly from news content in massive-scale news corpora, without relying on labeled data. To facilitate research in this domain, we propose and study a conceptual framework, where we compare how sources typically mention certain controversial entities, and use such as indicators for the sources' content selection preferences. We empirically show the capabilities of the framework through a case study on NELA-2020, a corpus of 1.8M news articles in English from 519 news sources worldwide. We demonstrate an unsupervised representation learning method to capture the selection preferences for how sources typically mention controversial entities. Our experiments show that that distributional divergence of such representations, when studied collectively across entities and news sources, serve as good indicators for an individual source's ideological leaning. We hope our findings will provide insights for future research on media selection biases.

翻译：摘要：新闻来源在报道某个主题时经历了选择新闻价值信息的过程。由于他们的议事日程不同，因此这个过程必然表现出选择偏见，即新闻来源选择什么信息包含在新闻报道中的典型模式。为了了解选择偏见的大小和影响，必须首先发现（1）在哪些主题上源通常具有有不同的“新闻价值”定义，以及（2）内容选择模式是否与新闻来源的某些属性（例如意识形态倾向等）相关联。本文的目标是研究和讨论建立可扩展的自然语言处理系统的挑战，以直接从大规模新闻语料库中发现媒体选择偏差的模式，而不依赖于标记数据。为了在此领域开展研究，我们提出并研究一个概念框架，在其中比较不同来源通常如何提及某些有争议的实体，并将其用作指示源的内容选择偏好的指标。通过对NELA-2020进行个案研究，我们证明了框架的能力，NELA-2020是来自全球519家新闻来源的180万篇英文新闻文章的语料库。我们展示了一种无监督表示学习方法，以捕获源通常如何提及有争议实体的选择偏好。我们的实验表明，当共同研究实体和新闻来源时，这些表示的分布差异可作为一个单独来源的意识形态倾向的良好指标。我们希望我们的发现将为未来关于媒体选择偏差的研究提供洞见。