ScreenCoder：通过模块化多模态智能体推进前端自动化中的视觉到代码生成 (ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents)

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

翻译：将用户界面（UI）设计自动化转换为前端代码，对于加速软件开发和普及设计工作流程具有重大前景。尽管多模态大语言模型（MLLMs）能够将图像翻译为代码，但它们通常在处理复杂UI时失败，难以在单一整体模型中统一视觉感知、布局规划和代码合成，这导致了频繁的感知和规划错误。为解决此问题，我们提出了ScreenCoder，一个模块化的多智能体框架，它将任务分解为三个可解释的阶段：基础定位、规划和生成。通过将这些不同的职责分配给专门的智能体，我们的框架实现了比端到端方法显著更高的鲁棒性和保真度。此外，ScreenCoder作为一个可扩展的数据引擎，使我们能够生成高质量的图像-代码对。我们利用这些数据，通过监督微调和强化学习的双阶段流程对开源MLLM进行微调，证明了其在UI生成能力上的显著提升。大量实验表明，我们的方法在布局准确性、结构连贯性和代码正确性方面达到了最先进的性能。我们的代码已在 https://github.com/leigest519/ScreenCoder 公开提供。