In zero-shot multilingual extractive text summarization, a model is typically trained on English summarization dataset and then applied on summarization datasets of other languages. Given English gold summaries and documents, sentence-level labels for extractive summarization are usually generated using heuristics. However, these monolingual labels created on English datasets may not be optimal on datasets of other languages, for that there is the syntactic or semantic discrepancy between different languages. In this way, it is possible to translate the English dataset to other languages and obtain different sets of labels again using heuristics. To fully leverage the information of these different sets of labels, we propose NLSSum (Neural Label Search for Summarization), which jointly learns hierarchical weights for these different sets of labels together with our summarization model. We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations across these two datasets.
翻译:在零光多语种抽取文本汇总中,典型的模型是英语汇总数据集培训,然后应用于其他语言的汇总数据集。根据英文黄金摘要和文件,通常使用超自然法生成提取汇总的句级标签。然而,在英语数据集上创建的这些单语标签可能不是其他语言数据集的最佳标签,因为不同语言之间存在混杂或语义差异。这样,就可以将英语数据集翻译到其他语言,并使用超自然法论再次获得不同的标签组。为了充分利用这些不同标签组的信息,我们提议使用NLSSum(Neural Label搜索Summarization),这些标签组共同学习这些不同标签组的等级权重,同时学习我们的合成模型。我们在MLSUM和Wikilingua数据集上进行了多语种零点抽取的汇总实验,我们利用这两个数据集的人类和自动评价,取得了最新的结果。