Wikipedia has been turned into an immensely popular crowd-sourced encyclopedia for information dissemination on numerous versatile topics in the form of subscription free content. It allows anyone to contribute so that the articles remain comprehensive and updated. For enrichment of content without compromising standards, the Wikipedia community enumerates a detailed set of guidelines, which should be followed. Based on these, articles are categorized into several quality classes by the Wikipedia editors with increasing adherence to guidelines. This quality assessment task by editors is laborious as well as demands platform expertise. As a first objective, in this paper, we study evolution of a Wikipedia article with respect to such quality scales. Our results show novel non-intuitive patterns emerging from this exploration. As a second objective we attempt to develop an automated data driven approach for the detection of the early signals influencing the quality change of articles. We posit this as a change point detection problem whereby we represent an article as a time series of consecutive revisions and encode every revision by a set of intuitive features. Finally, various change point detection algorithms are used to efficiently and accurately detect the future change points. We also perform various ablation studies to understand which group of features are most effective in identifying the change points. To the best of our knowledge, this is the first work that rigorously explores English Wikipedia article quality life cycle from the perspective of quality indicators and provides a novel unsupervised page level approach to detect quality switch, which can help in automatic content monitoring in Wikipedia thus contributing significantly to the CSCW community.
翻译:维基百科已变成一个广受欢迎的众源百科百科百科全书,用于以免费订阅内容的形式传播众多多功能主题的信息。它使任何人都能够作出贡献,使文章保持全面和更新。为了在不损害标准的情况下丰富内容,维基百科社区列举了一套详细的指南,应当遵循。根据这些指南,维基百科编辑将文章分为若干质量类,并越来越多地遵守准则。编辑的这一质量评估任务既艰巨又需要平台专门知识。作为第一个目标,我们在本文件中研究关于质量尺度的维基百科文章的演变情况。我们的成果显示了从本次探索中产生的新的非直观模式。作为第二个目标,我们试图开发一种自动数据驱动的方法,以探测影响文章质量变化的早期信号。我们将此视为一个变化点检测问题,即我们代表一篇文章的连续修订系列,并以一套直观特征编码每一份修订。最后,各种改变点检测方法的算法可以高效和准确地探测未来变化点。我们还进行了各种联系研究,以了解哪些群体的质量质量结构为最精确的版本。