ActionBert: 利用用户动作促进用户界面的语义理解 (ActionBert: Leveraging User Actions for Semantic Understanding of User Interfaces)

As mobile devices are becoming ubiquitous, regularly interacting with a variety of user interfaces (UIs) is a common aspect of daily life for many people. To improve the accessibility of these devices and to enable their usage in a variety of settings, building models that can assist users and accomplish tasks through the UI is vitally important. However, there are several challenges to achieve this. First, UI components of similar appearance can have different functionalities, making understanding their function more important than just analyzing their appearance. Second, domain-specific features like Document Object Model (DOM) in web pages and View Hierarchy (VH) in mobile applications provide important signals about the semantics of UI elements, but these features are not in a natural language format. Third, owing to a large diversity in UIs and absence of standard DOM or VH representations, building a UI understanding model with high coverage requires large amounts of training data. Inspired by the success of pre-training based approaches in NLP for tackling a variety of problems in a data-efficient way, we introduce a new pre-trained UI representation model called ActionBert. Our methodology is designed to leverage visual, linguistic and domain-specific features in user interaction traces to pre-train generic feature representations of UIs and their components. Our key intuition is that user actions, e.g., a sequence of clicks on different UI components, reveals important information about their functionality. We evaluate the proposed model on a wide variety of downstream tasks, ranging from icon classification to UI component retrieval based on its natural language description. Experiments show that the proposed ActionBert model outperforms multi-modal baselines across all downstream tasks by up to 15.5%.

翻译：由于移动设备正在变得无处不在,因此与各种用户的下游界面(UI)经常互动是许多人日常生活中常见的常事。为了改善这些设备的无障碍性,并使其能够在各种环境中使用,建立能够帮助用户和通过 UI 完成任务的模型至关重要。然而,要做到这一点,存在若干挑战。首先,类似外观的UI组件可能具有不同的功能,使其理解其功能比仅仅分析外观更重要,使其理解其功能的功能比仅仅分析外观更为重要。第二,在移动应用程序中,与多种用户的下游界面(UI)互动是许多人日常生活中常见的。为了改善这些装置的无障碍性,并使这些装置能够在各种环境中使用,建立能够帮助用户的模型至关重要。首先,类似外观的UIA组件可能具有不同的功能,比仅仅分析外观外观的功能更为重要。第二,在网页上的多级文档中,多级文档的多端点评估方法(DOM ) 和 Veerararararchy (VH ) 等域域域域域域域域域域域域域域域模型(U) 、AGOUBerBeria Breal Breal Breal-deal-liversal ex ex ex ex liverslational ex ex ex liverslationslationslationlational ex ex ex ex ex ex ex ex ex ex ex ex laverslational laut laut liversal extition laut laut lavers lavers laverstictions to to to to to to to to to to to to to level level to lections) livesut to laut to laut to laut to laut to lauts lauts lauts lactions to laut to lauts laut to lauts to lauts to to to to lauts to lauts to lauts to laut to lauts to lauts to lauts to lauts to lax lax lauts to la