The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/SIGMME/IWR-Bench.
翻译:网页到代码任务要求模型理解网页的视觉表示并生成相应代码。然而,现有基准主要关注静态截图到代码任务,从而忽视了现实世界网络应用中至关重要的动态交互。为弥补这一局限,本文提出了IWR-Bench——一个用于评估大型视觉语言模型从视频中重建交互式网页能力的新型基准。IWR-Bench包含从100个真实网站中精心筛选的113项任务,涵盖1,001个操作,并具有多样化的交互复杂度(如网页游戏)、视觉风格和领域。遵循标准网页开发实践,每项任务不仅包含用户交互视频,还提供所有爬取的静态资源(如图像、视频)。该基准从两个基本维度评估模型:通过视频和资源进行综合多模态推理以推断交互逻辑的能力,以及将逻辑转化为功能代码的高级代码生成能力。采用代理即评委框架配合综合指标系统,自动评估生成网页的功能正确性与视觉保真度。对28个大型视觉语言模型的广泛实验揭示了严峻挑战:最佳模型的总体得分仅为36.35%,其中功能正确性(24.39% IFS)显著落后于视觉保真度(64.25% VFS)。这些结果凸显了当前模型在时序动态推理与事件驱动逻辑合成能力上的关键局限,确立了IWR-Bench作为视觉语言研究前沿挑战的地位。基准数据集与评估代码将通过https://github.com/SIGMME/IWR-Bench公开提供。