We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.
翻译:我们从大型新闻公司中提出Quotsstrap(Quotsstrap),这是从大型新闻公司中提取引文的方法,以及发表引文的发言者的姓名。先前的工作主要通过监督的机器学习来解决这一问题,而我们的方法则遵循完全无人监督的靴式模式。它利用大型新闻公司中存在的冗余,更准确地说,在略有不同的背景下,在多个新闻文章中经常出现同样的引文。从几个种子模式(如[“Q 说 )开始,我们的方法提取了一套引文对(Q,S),这些对书反过来用于发现表达相同引文的新模式;然后又用更大的模式重复了这一过程。我们的算法是高度可缩放的,我们通过在2011年ICWSM Spinn3rapall大版上运行该算法来证明。对照来自众人源的地面真理来验证我们的结果,我们用40%的精确度来回顾一个单一种子模式,用高得多的回溯值来更经常报告(因此可能更有趣的)引文。最后,我们通过分析我们所表现的社运算出来的社运的报价的有用性。