Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner. Can machines learn to emulate this laborious process? We present a novel task and approach for document-to-slide generation. Solving this involves document summarization, image and text retrieval, slide structure, and layout prediction to arrange key elements in a form suitable for presentation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides. To help accelerate research in this domain, we release a dataset about 6K paired documents and slide decks used in our experiments. We show that our approach outperforms strong baselines and produces slides with rich content and aligned imagery.
翻译:制作演示材料需要复杂的多式联运推理技能,以总结关键概念,并以逻辑和视觉上令人愉快的方式安排这些概念。机器能否学习模仿这种艰苦的过程?我们为从文件到滑动的生成提出了一个新的任务和方法。解决这个问题需要文件汇总、图像和文本检索、幻灯片结构和布局预测,以便以适合展示的形式安排关键要素。我们建议了分级顺序到顺序的方法,以便以端到端的方式处理我们的任务。我们的方法利用了文档和幻灯片中固有的结构,并结合了参数和布局预测模块来生成幻灯片。为了帮助加速这一领域的研究,我们发布了一套关于实验中使用的6K对齐文档和幻灯片甲板的数据集。我们展示了我们的方法优于强的基线,产生了内容丰富、图像一致的幻灯片。