Multimodal summarisation with multimodal output is drawing increasing attention due to the rapid growth of multimedia data. While several methods have been proposed to summarise visual-text contents, their multimodal outputs are not succinct enough at an extreme level to address the information overload issue. To the end of extreme multimodal summarisation, we introduce a new task, eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR. XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary. We propose a novel unsupervised Hierarchical Optimal Transport Network (HOT-Net) consisting of three components: hierarchical multimodal encoders, hierarchical multimodal fusion decoders, and optimal transport solvers. Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans. To facilitate the study on this task, we collect a large-scale dataset XMSMO-News by harvesting 4,891 video-document pairs. The experimental results show that our method achieves promising performance in terms of ROUGE and IoU metrics.
翻译:由于多媒体数据迅速增长,多式产出的多式合成正在引起越来越多的关注。虽然提出了几种方法来总结视觉文本内容,但其多式联运产出在极端水平上不够简洁,不足以解决信息超载问题。在极端的多式联运汇总的结尾,我们引入了一项新的任务,即以多式输出(XMSMO)对TL(XMMO)的情景进行Xtreme多式合成;DW - 太长;没有观察(类似于TL);DR.XMSMO旨在将一对视频文件的配对归纳成一个非常短的概要,由作为视觉摘要的一个封面框架和一个句子组成,作为文本摘要。在极端的多式联运汇总的结尾,我们提出了一个新的、没有超超超超超的高度优化最佳运输网络(HOT-Net),由三个部分组成:等级的多式联运编码、等级的多式集聚变器和最佳的运输溶剂。我们的方法经过培训,没有参考摘要,而是将视觉和文字覆盖从SENE的距离的角度加以选择,由一个封面框框框框框框框框架组成,作为视觉摘要,作为视觉摘要,作为视觉摘要,作为摘要,作为文本摘要,作为文本摘要,我们根据最佳运输计划进行大规模分配的结果。我们最有希望的磁标定的磁标的磁制的图,我们用。