In the past decade, many scientific news media that report scientific breakthroughs and discoveries emerged, bringing science and technology closer to the general public. However, not all scientific news article cites proper sources, such as original scientific papers. A portion of scientific news articles contain misinterpreted, exaggerated, or distorted information that deviates from facts asserted in the original papers. Manually identifying proper citations is laborious and costly. Therefore, it is necessary to automatically search for pertinent scientific papers that could be used as evidence for a given piece of scientific news. We propose a system called SciEv that searches for scientific evidence papers given a scientific news article. The system employs a 2-stage query paradigm with the first stage retrieving candidate papers and the second stage reranking them. The key feature of SciEv is it uses domain knowledge entities (DKEs) to find candidates in the first stage, which proved to be more effective than regular keyphrases. In the reranking stage, we explore different document representations for news articles and candidate papers. To evaluate our system, we compiled a pilot dataset consisting of 100 manually curated (news,paper) pairs from ScienceAlert and similar websites. To our best knowledge, this is the first dataset of this kind. Our experiments indicate that the transformer model performs the best for DKE extraction. The system achieves a P@1=50%, P@5=71%, and P@10=74% when it uses a TFIDF-based text representation. The transformer-based re-ranker achieves a comparable performance but costs twice as much time. We will collect more data and test the system for user experience.
翻译:在过去十年中,出现了许多报告科学突破和发现的科学新闻媒体,使科学技术更接近公众。然而,并非所有科学新闻文章都引用了适当来源,如原始科学论文。科学新闻文章中有一部分含有错误、夸大或扭曲的信息,与原始论文中的事实不符。人工识别正确引用是困难和昂贵的。因此,有必要自动搜索相关科学论文,这些论文可以用作某一科学新闻的证据。我们提议了一个名为SciEv的系统,用于搜索科学证据文件,并发表科学新闻文章。系统使用第一阶段检索候选论文和第二阶段重新排档的2阶段查询模式。SciEv的关键特征是使用域知识实体(DKE)在第一阶段找到候选人,这比普通关键词句更有效。在重新排档阶段,我们探索新闻文章和候选论文的不同文件表达方式。为了评估我们的系统,我们汇编了一个由100个手动曲线组成的试点数据集(第一期、第一期纸质)和第二阶段重新排版。ScienceALE50的主要特征是使用这个测试系统,这是我们最高级的测试系统,我们最高级的版本。