To improve the accessibility of smart devices and to simplify their usage, building models which understand user interfaces (UIs) and assist users to complete their tasks is critical. However, unique challenges are proposed by UI-specific characteristics, such as how to effectively leverage multimodal UI features that involve image, text, and structural metadata and how to achieve good performance when high-quality labeled data is unavailable. To address such challenges we introduce UIBert, a transformer-based joint image-text model trained through novel pre-training tasks on large-scale unlabeled UI data to learn generic feature representations for a UI and its components. Our key intuition is that the heterogeneous features in a UI are self-aligned, i.e., the image and text features of UI components, are predictive of each other. We propose five pretraining tasks utilizing this self-alignment among different features of a UI component and across various components in the same UI. We evaluate our method on nine real-world downstream UI tasks where UIBert outperforms strong multimodal baselines by up to 9.26% accuracy.
翻译:为改善智能设备的可获取性并简化其使用,建立理解用户界面(UI)并协助用户完成任务的模型至关重要,然而,用户界面的具体特点提出了独特的挑战,例如如何有效利用涉及图像、文本和结构性元数据的多式UI特征,以及在没有高质量标签数据时如何取得良好业绩。为了应对这些挑战,我们引入了UIBert, 这是一种基于变压器的联合图像文本模型,这是通过大规模无标签UI数据的新培训前任务培训的,目的是了解UIBert及其组成部分的一般特征说明。我们的主要直觉是,用户界面中的各种特征是自成一体的,即UI构成部分的图像和文本特征是相互预测的。我们提出了五项培训前任务,即利用UIBert各组成部分不同特点和同一用户界面各组成部分之间的这种自我调整。我们评估了九项实际下游用户界面UIIU任务的方法,即UIBert在其中以9.26%的准确度超过强的多式基线。