Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion programs and 3D trajectories. Experiments demonstrate LAMP's improved performance in motion controllability and alignment with user intent compared to state-of-the-art alternatives establishing the first framework for generating both object and camera motions directly from natural language specifications. Code, models and data are available on our project page.
翻译:视频生成在视觉保真度和可控性方面取得了显著进展,能够基于文本、布局或运动进行条件生成。其中,运动控制——即指定物体动态和相机轨迹——对于构建复杂的电影级场景至关重要,然而现有交互方式仍存在局限。我们提出了LAMP,该方法利用大型语言模型作为运动规划器,将自然语言描述转化为动态物体和(相对定义的)相机的显式三维轨迹。LAMP受电影摄影惯例启发,定义了一种运动领域专用语言。通过利用大型语言模型的程序合成能力,LAMP能够从自然语言生成结构化运动程序,并将其确定性映射为三维轨迹。我们构建了一个大规模程序化数据集,将自然语言描述与对应的运动程序及三维轨迹进行配对。实验表明,相较于现有先进方法,LAMP在运动可控性和用户意图对齐方面表现出更优性能,首次建立了直接从自然语言规范生成物体与相机运动的框架。代码、模型及数据已在项目页面公开。