MotionAgent：通过运动场代理实现细粒度可控视频生成 (MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent)

We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields, providing flexible and precise motion guidance. Specifically, the agent extracts the object movement and camera motion described in the text and converts them into object trajectories and camera extrinsics, respectively. An analytical optical flow composition module integrates these motion representations in 3D space and projects them into a unified optical flow. An optical flow adapter takes the flow to control the base image-to-video diffusion model for generating fine-grained controlled videos. The significant improvement in the Video-Text Camera Motion metrics on VBench indicates that our method achieves precise control over camera motion. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.

翻译：我们提出MotionAgent，实现了文本引导图像到视频生成中的细粒度运动控制。其核心技术是运动场代理，能够将文本提示中的运动信息转换为显式运动场，提供灵活且精确的运动指导。具体而言，该代理提取文本描述的对象运动和相机运动，并分别将其转换为对象轨迹和相机外参。一个解析式光流合成模块在三维空间中整合这些运动表示，并将其投影为统一的光流。光流适配器利用该光流控制基础的图像到视频扩散模型，从而生成细粒度可控的视频。在VBench基准的Video-Text Camera Motion指标上的显著提升表明，我们的方法实现了对相机运动的精确控制。我们构建了VBench的子集以评估文本运动信息与生成视频的对齐程度，在运动生成准确性上超越了其他先进模型。