Models designed for intelligent process automation are required to be capable of grounding user interface elements. This task of interface element grounding is centred on linking instructions in natural language to their target referents. Even though BERT and similar pre-trained language models have excelled in several NLP tasks, their use has not been widely explored for the UI grounding domain. This work concentrates on testing and probing the grounding abilities of three different transformer-based models: BERT, RoBERTa and LayoutLM. Our primary focus is on these models' spatial reasoning skills, given their importance in this domain. We observe that LayoutLM has a promising advantage for applications in this domain, even though it was created for a different original purpose (representing scanned documents): the learned spatial features appear to be transferable to the UI grounding setting, especially as they demonstrate the ability to discriminate between target directions in natural language instructions.
翻译:用于智能程序自动化的模型必须能够定位用户界面元素。 界面元素的定位任务集中在将自然语言的指令与其目标参照点联系起来上。 尽管BERT和类似的预先培训的语言模型在一些NLP任务中表现优异, 但这些模型的使用尚未被广泛探索用于UI的定位域。 这项工作侧重于测试和探测基于三个不同变压器的模型的地面能力: BERT、 RoBERTA和TlumpLM。 我们的主要重点是这些模型的空间推理技能,因为它们在这方面的重要性。 我们观察到DlogLM对这个领域的应用具有很有希望的优势, 尽管它是为了不同的原始目的( 代表扫描文件) : 所学到的空间特征似乎可以转移到定位设置, 特别是因为它们显示在自然语言指令中区分目标方向的能力 。