确定谷歌书集的特征:对社会-文化和语言演变的推论的严格限制 (Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution)

It is tempting to treat frequency trends from the Google Books data sets as indicators of the "true" popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We highlight these dynamics by examining and comparing major contributions to the statistical divergence of English data sets between decades in the period 1800--2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts, in clear contrast to the first version of the fiction data set and both unfiltered English data sets. Our findings emphasize the need to fully characterize the dynamics of the Google Books corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.

翻译：将Google Books数据组的频率趋势作为各种词句和短语的“真正”流行程度的指标来对待Google Books数据集的频率趋势是诱人的。这样做使我们得以就特定主题,如时间或性别等的文化观念的演变得出数量上强有力的结论。然而,Google Books文体受到若干限制,这使得它成为文化受欢迎的隐蔽面。一个主要问题是,该文体实际上是一个图书馆,每本书都包含一个图书馆。因此,一个单一的、大量作者能够明显地在Google Books Lexicon中插入新的词句子,不论作者是否广泛阅读。有了这一理解,Google Books文集仍然是一套重要数据组的演变过程。此处,我们显示一个截然不同的问题特征是科学文体,在1900年代中,科学文体已成为该文体的日益实质性的部分。其结果是,一个典型的词组是典型的,但一般而言不太常见的词组,例如引用的时段。我们通过研究和比较这些动态,我们发现这些动态,在18世纪的英文数据组与直观中,我们只需要从18世纪的法系中的数据组与直观中的数据组与直观之间的对比。