In this paper, we introduce the first fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in our corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts. Additionally, we establish a manual candidate selection method and demonstrate its feasibility in high quality paraphrase selection in terms of both cost and quality.
翻译:在本文中,我们为芬兰人引入了第一个完全人工手动附加说明的副句,其中包含了从替代字幕和新闻标题中提取的53 572对副句子。 在我们的文稿98%的所有副句子中,至少有98%被手工归类为在特定情况下(如果不是在所有情况下)的副句子。此外,我们制定了一个人工选择候选人的方法,并用成本和质量两方面的高质量副句子选择来证明其可行性。