In recent months, we witness a leap forward as denoising diffusion models were introduced to Motion Generation. Yet, the main gap in this field remains the low availability of data. Furthermore, the expensive acquisition process of motion biases the already modest data towards short single-person sequences. With such a shortage, more elaborate generative tasks are left behind. In this paper, we show that this gap can be mitigated using a pre-trained diffusion-based model as a generative prior. We demonstrate the prior is effective for fine-tuning, in a few-, and even a zero-shot manner. For the zero-shot setting, we tackle the challenge of long sequence generation. We introduce DoubleTake, an inference-time method with which we demonstrate up to 10-minute long animations of prompted intervals and their meaningful and controlled transition, using the prior that was trained for 10-second generations. For the few-shot setting, we consider two-person generation. Using two fixed priors and as few as a dozen training examples, we learn a slim communication block, ComMDM, to infuse interaction between the two resulting motions. Finally, using fine-tuning, we train the prior to semantically complete motions from a single prescribed joint. Then, we use our DiffusionBlending to blend a few such models into a single one that responds well to the combination of the individual control signals, enabling fine-grained joint- and trajectory-level control and editing. Using an off-the-shelf state-of-the-art (SOTA) motion diffusion model as a prior, we evaluate our approach for the three mentioned cases and show that we consistently outperform SOTA models that were designed and trained for those tasks.
翻译:在最近几个月里,我们目睹了一个突破。 然而,这个领域的主要差距仍然是数据提供率低。 此外,昂贵的运动获取过程将本已不多的数据偏向于短单人的序列。 如此短缺, 更复杂的基因化任务就被抛在后面。 在本文中, 我们用预先训练的基于传播的模型作为基因化的先导来显示这一差距可以缩小。 我们展示了前者对于微调是有效的, 以少数甚至零发的方式。 对于零发效果的设置, 我们处理长顺序生成的挑战。 我们引入了双拍, 这是一种推论时间方法, 用来显示10分钟的促动间隔及其有意义和受控制的过渡, 使用以前训练为10秒的。 我们把两张照片的基于预先训练的基于扩散的模型, 使用两张固定的模型, 我们学习了一个微小的通信块, ComMDMM, 来将两个结果的动作完全地联系起来。 最后, 我们用微调的、 使前三部的动态调整, 我们用一个前制的模型来显示一模范的 。</s>