Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen, which can be useful for many language-based application scenarios. We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase. Summarizing mobile screens requires a holistic understanding of the multi-modal data of mobile UIs, including text, image, structures as well as UI semantics, motivating our multi-modal learning approach. We collected and analyzed a large-scale screen summarization dataset annotated by human workers. Our dataset contains more than 112k language summarization across $\sim$22k unique UI screens. We then experimented with a set of deep models with different configurations. Our evaluation of these models with both automatic accuracy metrics and human rating shows that our approach can generate high-quality summaries for mobile screens. We demonstrate potential use cases of Screen2Words and open-source our dataset and model to lay the foundations for further bridging language and user interfaces.
翻译:移动用户界面 Summarization 生成了用于传递屏幕重要内容和功能的移动屏幕的简明语言描述, 可用于许多基于语言的应用情景。 我们展示了Screen2Words(Screen2Words), 这是一种新型屏幕摘要化方法, 将UI屏幕的基本信息自动包含在一致的语言短语中。 描述移动屏幕需要全面理解移动UIs的多模式数据, 包括文字、 图像、 结构以及 UI 语义学, 激励我们的多模式学习方法。 我们收集和分析了由人类工作者附加注释的大型屏幕汇总数据。 我们的数据集包含超过112k 语言的组合, 跨越$\ sim$22k 独特的 UI 屏幕。 我们随后实验了一套不同配置的深度模型。 我们用自动准确度指标和人类评级来评估这些模型, 表明我们的方法可以为移动屏幕生成高质量的摘要。 我们展示了使用Screen2Words和开源数据集和模型的可能性, 为进一步连接语言和用户界面奠定基础。