Synthesizing realistic human-object interactions (HOI) in video is challenging due to the complex, instance-specific interaction dynamics of both humans and objects. Incorporating controllability in video generation further adds to the complexity. Existing controllable video generation approaches face a trade-off: sparse controls like keypoint trajectories are easy to specify but lack instance-awareness, while dense signals such as optical flow, depths or 3D meshes are informative but costly to obtain. We propose VHOI, a two-stage framework that first densifies sparse trajectories into HOI mask sequences, and then fine-tunes a video diffusion model conditioned on these dense masks. We introduce a novel HOI-aware motion representation that uses color encodings to distinguish not only human and object motion, but also body-part-specific dynamics. This design incorporates a human prior into the conditioning signal and strengthens the model's ability to understand and generate realistic HOI dynamics. Experiments demonstrate state-of-the-art results in controllable HOI video generation. VHOI is not limited to interaction-only scenarios and can also generate full human navigation leading up to object interactions in an end-to-end manner. Project page: https://vcai.mpi-inf.mpg.de/projects/vhoi/.
翻译:由于人与物体之间复杂且实例特定的交互动态,合成真实的人-物交互(HOI)视频具有挑战性。在视频生成中引入可控性进一步增加了复杂度。现有可控视频生成方法面临权衡:稀疏控制(如关键点轨迹)易于指定但缺乏实例感知,而密集信号(如光流、深度或三维网格)信息丰富但获取成本高昂。我们提出VHOI,一个两阶段框架:首先将稀疏轨迹稠密化为HOI掩码序列,随后基于这些稠密掩码微调视频扩散模型。我们引入了一种新颖的HOI感知运动表征,通过颜色编码不仅区分人与物体的运动,还能识别身体部位特定的动态。该设计将人体先验融入条件信号,增强了模型理解和生成真实HOI动态的能力。实验证明该方法在可控HOI视频生成中达到最先进水平。VHOI不仅限于纯交互场景,还能以端到端方式生成包含完整人体导航直至物体交互的完整序列。项目页面:https://vcai.mpi-inf.mpg.de/projects/vhoi/。