Networks of documents connected by hyperlinks, such as Wikipedia, are ubiquitous. Hyperlinks are inserted by the authors to enrich the text and facilitate the navigation through the network. However, authors tend to insert only a fraction of the relevant hyperlinks, mainly because this is a time consuming task. In this paper we address an annotation, which we refer to as anchor prediction. Even though it is conceptually close to link prediction or entity linking, it is a different task that require developing a specific method to solve it. Given a source document and a target document, this task consists in automatically identifying anchors in the source document, i.e words or terms that should carry a hyperlink pointing towards the target document. We propose a contextualized relational topic model, CRTM, that models directed links between documents as a function of the local context of the anchor in the source document and the whole content of the target document. The model can be used to predict anchors in a source document, given the target document, without relying on a dictionary of previously seen mention or title, nor any external knowledge graph. Authors can benefit from CRTM, by letting it automatically suggest hyperlinks, given a new document and the set of target document to connect to. It can also benefit to readers, by dynamically inserting hyperlinks between the documents they're reading. Experiments conducted on several Wikipedia corpora (in English, Italian and German) highlight the practical usefulness of anchor prediction and demonstrate the relevancy of our approach.
翻译:通过超链接连接的文件网络,例如维基百科,是无处不在的。超链接由作者插入,以丰富文本,促进网络导航。然而,作者倾向于只插入部分相关超链接,这主要是因为这是一个耗时的任务。在本文件中,我们处理一个注释,我们称之为锚点预测。虽然在概念上接近于连接预测或实体连接,但任务不同,需要制定具体方法来解决。鉴于源文件和目标文件,这项任务包括自动识别源文件的锚,即应该带有目标文件超链接的词句或术语。我们提出一个背景化关联主题模型,即CRTM,该模型指导文件之间的链接,作为源文件锚点和目标文件的整个内容的函数。该模型可以用来预测源文件的锚值,考虑到目标文件,不依赖先前看到的名称或标题的字典,也不依赖任何外部知识图表。作者可以受益于CRTM,通过让其自动地将链接主题主题主题主题主题模型(CRTM)的链接模式将文件作为源点的本地背景链接,也可以通过将一些动态链接链接链接链接文件,并且将一些动态文件的链接链接与新设置。