In this paper, we approach the problem of semantic search by framing the search task as paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that our paraphrase span detection model outperforms two strong retrieval baselines (lexical similarity and BERT sentence embeddings) by 31.9pp and 22.4pp respectively in terms of exact match, and by 22.3pp and 12.9pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the task in terms of span retrieval, rather than sentence similarity. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available.
翻译:在本文中,我们处理语义搜索问题的方式是将搜索任务作为参数跨度探测,即将一段文字作为查询短语,任务是在某一文件中确定其参数,与采掘问题解答时通常使用的模式设置相同。关于Turku Paradhance Corpus, 10万条手动提取的芬兰参数配对,包括其原始文件背景,我们发现我们的语句跨度探测模型在精确匹配方面比两个强的检索基线(传统相似性和BERT句嵌入)分别高出31.9pp和22.4pp,在象征性F-核心方面比22.3pp和12.9pp。这显示了在截取范围而非类似句方面模拟任务的巨大优势。此外,我们引入了一种方法,通过背译创建人工参数数据,这适用于那些没有手动说明用于培训跨度探测模型的参数资源的语言。