Many accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sufficient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces often best reflect an app's full functionality. We trained a robust, fast, memory-efficient, on-device model to detect UI elements using a dataset of 77,637 screens (from 4,068 iPhone apps) that we collected and annotated. To further improve UI detections and add semantic information, we introduced heuristics (e.g., UI grouping and ordering) and additional models (e.g., recognize UI content, state, interactivity). We built Screen Recognition to generate accessibility metadata to augment iOS VoiceOver. In a study with 9 screen reader users, we validated that our approach improves the accessibility of existing mobile apps, enabling even previously inaccessible apps to be used.
翻译:移动平台上的许多无障碍功能要求应用程序(应用程序)提供完整和准确的元数据,描述用户界面(UI)组件。 不幸的是,许多应用程序没有提供足够的元数据,以提供预期的无障碍功能。在本文件中,我们探索从其像素中推断移动应用程序的无障碍元数据,因为视觉界面通常最能反映应用程序的全部功能。我们训练了一个强大、快速、记忆高效的在线设计模型,用我们收集的77 637个屏幕(4 068 iPhone Apps)和附加说明的数据集检测UI元素。为了进一步改进对界面的检测和添加语义信息,我们引入了超常(如UIUI分组和订购)和其他模型(如识别UI的内容、状态、互动性)。我们建立了屏幕认知,以生成无障碍元数据,增强iOS语音。在与9个屏幕阅读器用户进行的一项研究中,我们确认我们的方法改善了现有移动应用程序的无障碍性,使以前无法访问的应用程序得以使用。