维基文件:基于维基百科的数据集,用于从段落中生成短描述 (WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions from Paragraphs)

As free online encyclopedias with massive volumes of content, Wikipedia and Wikidata are key to many Natural Language Processing (NLP) tasks, such as information retrieval, knowledge base building, machine translation, text classification, and text summarization. In this paper, we introduce WikiDes, a novel dataset to generate short descriptions of Wikipedia articles for the problem of text summarization. The dataset consists of over 80k English samples on 6987 topics. We set up a two-phase summarization method - description generation (Phase I) and candidate ranking (Phase II) - as a strong approach that relies on transfer and contrastive learning. For description generation, T5 and BART show their superiority compared to other small-scale pre-trained models. By applying contrastive learning with the diverse input from beam search, the metric fusion-based ranking models outperform the direct description generation models significantly up to 22 ROUGE in topic-exclusive split and topic-independent split. Furthermore, the outcome descriptions in Phase II are supported by human evaluation in over 45.33% chosen compared to 23.66% in Phase I against the gold descriptions. In the aspect of sentiment analysis, the generated descriptions cannot effectively capture all sentiment polarities from paragraphs while doing this task better from the gold descriptions. The automatic generation of new descriptions reduces the human efforts in creating them and enriches Wikidata-based knowledge graphs. Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions. Finally, we expect WikiDes to be a useful dataset for related works in capturing salient information from short paragraphs. The curated dataset is publicly available at: https://github.com/declare-lab/WikiDes.

翻译：由于在线自由百科全书的内容数量庞大,维基百科和维基数据是许多自然语言处理(NLP)任务的关键,例如信息检索、知识库建设、机器翻译、文本分类和文本总和。在本文中,我们引入了WikiDes,这是一套新颖的数据集,用来为文本对称问题生成对维基百科文章的简短描述。该数据集包含关于6987个专题的超过80k个英语样本。我们设置了两阶段汇总方法 - 描述生成(第一阶段)和候选人排名(第二阶段) - 作为一种强有力的方法,依赖于传输和对比学习。对于描述生成,T5和BART显示了它们与其他小规模预培训模型相比的优越性。我们采用比照性学习新版本文章对维基百科文章的描述,基于标准的指数排名模型大大超越了直接生成模型的22个ROOUGE, 主题的分裂和基于主题的分裂。此外, 第二阶段的结果描述得到了人类评估的支持,我们选择了45.33%到23.66%, 用于第一阶段的深度描述。对于金本部/维基文件的深度数据描述来说, 无法有效地从我们的数据分析。