MLLM作为界面评判者：评估多模态大语言模型在预测用户界面人类感知方面的能力 (MLLM as a UI Judge: Benchmarking Multimodal LLMs for Predicting Human Perception of User Interfaces)

Reuben A. Luera,Ryan Rossi,Franck Dernoncourt,Samyadeep Basu,Sungchul Kim,Subhojyoti Mukherjee,Puneet Mathur,Ruiyi Zhang,Jihyung Kil,Nedim Lipka,Seunghyun Yoon,Jiuxiang Gu,Zichao Wang,Cindy Xiong Bearfield,Branislav Kveton

In an ideal design pipeline, user interface (UI) design is intertwined with user research to validate decisions, yet studies are often resource-constrained during early exploration. Recent advances in multimodal large language models (MLLMs) offer a promising opportunity to act as early evaluators, helping designers narrow options before formal testing. Unlike prior work that emphasizes user behavior in narrow domains such as e-commerce with metrics like clicks or conversions, we focus on subjective user evaluations across varied interfaces. We investigate whether MLLMs can mimic human preferences when evaluating individual UIs and comparing them. Using data from a crowdsourcing platform, we benchmark GPT-4o, Claude, and Llama across 30 interfaces and examine alignment with human judgments on multiple UI factors. Our results show that MLLMs approximate human preferences on some dimensions but diverge on others, underscoring both their potential and limitations in supplementing early UX research.

翻译：在理想的设计流程中，用户界面（UI）设计应与用户研究紧密结合以验证设计决策，但在早期探索阶段，研究往往受限于资源。多模态大语言模型（MLLMs）的最新进展为充当早期评估者提供了有前景的机会，可帮助设计师在正式测试前缩小选择范围。与先前工作侧重于电子商务等狭窄领域中用户点击或转化率等行为指标不同，我们关注跨多样界面的主观用户评价。我们研究了MLLMs在评估单个UI及进行界面比较时，能否模拟人类偏好。利用众包平台数据，我们对GPT-4o、Claude和Llama模型在30个界面上进行了基准测试，并检验了其在多个UI评价维度上与人类判断的一致性。结果表明，MLLMs在某些维度上能近似人类偏好，但在其他维度存在差异，这凸显了其在补充早期用户体验研究方面的潜力与局限性。