3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
翻译:三维人体反应生成面临三大挑战:(1) 运动高保真度,(2) 实时推理能力,以及 (3) 在线场景下的自回归适应性。现有方法难以同时满足所有要求。本文提出ARMFlow,一种基于均值流的自回归框架,用于建模施动者与反应者运动间的时序依赖关系。该框架包含一个因果上下文编码器和一个基于MLP的速度预测器。我们在训练中引入自举上下文编码(BSCE),通过编码生成的历史而非真实历史数据,以缓解自回归生成中的误差累积问题。我们还进一步提出了离线变体ReMFlow,在离线方法中实现了最优性能与最快推理速度。ARMFlow通过以下方式解决了在线设置的关键局限:(1) 借助全局上下文编码器增强语义对齐;(2) 在单步推理中实现高精度与低延迟;(3) 通过BSCE减少累积误差。在InterHuman和InterX数据集上,我们的单步在线生成方法在FID指标上超越现有在线方法超过40%,同时仅使用部分序列条件即可达到与离线最优方法相当的性能。