Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebUI, a large dataset of 400,000 rendered web pages associated with automatically extracted metadata. We analyze the composition of WebUI and show that while automatically extracted data is noisy, most examples meet basic criteria for visual UI modeling. We applied several strategies for incorporating semantics found in web pages to increase the performance of visual UI understanding models in the mobile domain, where less labeled data is available: (i) element detection, (ii) screen classification and (iii) screen similarity.
翻译:从视觉信息建模用户界面(UIs)使各系统能够对支持无障碍、应用自动化和测试方面使用案例所需的功能和语义进行推断。目前用于培训机器学习模型的数据集规模有限,因为人工收集和说明UIs的过程耗时费时费钱。我们爬过网络,以构建与自动提取元数据有关的400,000个大数据集WebUI。我们分析了WebUI的构成,并表明自动提取的数据虽然吵闹,但大多数例子都符合视觉界面建模的基本标准。我们采用了若干战略,将网页上发现的语义纳入网页,以提高移动领域视觉界面理解模型的性能,在移动领域有较少标签的数据:(一) 元素检测,(二) 屏幕分类和(三) 屏幕相似性。