丰富和改进文字知识库的方法 (Approaches for Enriching and Improving Textual Knowledge Bases)

Verifiability is one of the core editing principles in Wikipedia, where editors are encouraged to provide citations for the added statements. Statements can be any arbitrary piece of text, ranging from a sentence up to a paragraph. However, in many cases, citations are either outdated, missing, or link to non-existing references (e.g. dead URL, moved content etc.). In total, 20\% of the cases such citations refer to news articles and represent the second most cited source. Even in cases where citations are provided, there are no explicit indicators for the span of a citation for a given piece of text. In addition to issues related with the verifiability principle, many Wikipedia entity pages are incomplete, with relevant information that is already available in online news sources missing. Even for the already existing citations, there is often a delay between the news publication time and the reference time. In this thesis, we address the aforementioned issues and propose automated approaches that enforce the verifiability principle in Wikipedia, and suggest relevant and missing news references for further enriching Wikipedia entity pages.

翻译：可核实性是维基百科的核心编辑原则之一,在维基百科中,鼓励编辑提供附加声明的引文。发言可以是任意的文字,从句子到段落不等。但在许多情况下,引用要么过时、缺失,要么与不存在的引用链接(例如死址、移动的内容等)。总共20个案例,这类引用是指新闻文章,代表第二个引用最多的来源。即使提供了引文,也没有关于某一文本引用范围的明确指标。除了与可核查原则有关的问题外,许多维基百科实体网页不完整,网上新闻来源已经缺少相关信息。即使已有的引用,在新闻发布时间和参考时间之间也经常出现延误。在此论文中,我们讨论了上述问题,并提出在维基百科中执行可核查原则的自动化方法,并提出进一步丰富维基百科实体网页的相关和缺失的新闻参考资料。