We propose to use captions from the Web as a previously underutilized resource for paraphrases (i.e., texts with the same "message") and to create and analyze a corresponding dataset. When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyze captions in the English Wikipedia, where editors frequently relabel the same image for different articles. The paper introduces the underlying mining technology and compares known paraphrase corpora with respect to their syntactic and semantic paraphrase similarity to our new resource. In this context, we introduce characteristic maps along the two similarity dimensions to identify the style of paraphrases coming from different sources. An annotation study demonstrates the high reliability of the algorithmically determined characteristic maps.
翻译:我们建议使用Web的字幕作为以前未充分利用的参数(即具有相同“消息”的文本)资源,并创建和分析相应的数据集。当图像在Web上被再利用时,通常会指定原始字幕。我们假设同一图像的不同标题自然形成一套共同的参数。为了证明这一想法的适宜性,我们分析了英文维基百科的字幕,编辑们经常在其中为不同文章重新标出相同的图像。本文介绍了采矿基础技术,比较了已知的副词Corpora与我们的新资源的合成和语义相似性。在这方面,我们沿着两个相似的维度绘制了特征地图,以辨别不同来源的副词的风格。一个注解研究显示了以逻辑测定的特征地图的高度可靠性。