Natural language descriptions of user interface (UI) elements such as alternative text are crucial for accessibility and language-based interaction in general. Yet, these descriptions are constantly missing in mobile UIs. We propose widget captioning, a novel task for automatically generating language descriptions for UI elements from multimodal input including both the image and the structural representations of user interfaces. We collected a large-scale dataset for widget captioning with crowdsourcing. Our dataset contains 162,859 language phrases created by human workers for annotating 61,285 UI elements across 21,750 unique UI screens. We thoroughly analyze the dataset, and train and evaluate a set of deep model configurations to investigate how each feature modality as well as the choice of learning strategies impact the quality of predicted captions. The task formulation and the dataset as well as our benchmark models contribute a solid basis for this novel multimodal captioning task that connects language and user interfaces.
翻译:对用户界面(UI)要素的自然语言描述,例如替代文本,对于无障碍和一般基于语言的互动至关重要。然而,移动用户界面中经常缺少这些描述。我们提议了部件说明,这是从多式输入中自动生成用户界面要素的语言描述的新任务,包括图像和用户界面的结构表述。我们收集了一个大型数据集,用于与众包一同进行部件说明。我们的数据集包含162 859个由工人为21 750个独特的用户界面中61 285个用户界面元素的注释而创建的语言短语。我们深入分析数据集,训练和评价一套深层模型配置,以调查每种特征模式以及学习战略的选择如何影响预测的字幕的质量。任务设计和数据集以及我们的基准模型为这一将语言和用户界面连接起来的新式的多式说明任务提供了坚实的基础。