We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems.
翻译:本研究提出一种视觉-语言仿真模型(VLSM),通过融合视觉与文本理解能力,能够根据布局草图和自然语言提示生成可执行的FlexScript代码,实现工业仿真系统的跨模态推理。为支撑这一新范式,研究构建了首个面向生成式数字孪生的大规模数据集,包含超过12万组提示-草图-代码三元组,支持文本描述、空间结构与仿真逻辑间的多模态学习。同时,针对该任务专门提出三项创新评估指标:结构有效性率(SVR)、参数匹配率(PMR)与执行成功率(ESR),以全面评估生成结果的结构完整性、参数保真度及仿真器可执行性。通过对视觉编码器、连接模块及代码预训练语言主干网络进行系统化消融实验,所提模型实现了近乎完美的结构准确度与高执行鲁棒性。本工作为生成式数字孪生奠定了技术基础,将视觉推理与语言理解能力整合至可执行的工业仿真系统中。