Video Diffusion Models Encode Motion in Early Timesteps

Vatsal Baherwani, Yixuan Ren, Abhinav Shrivastava

University of Maryland
Under Review

Paper arXiv coming soon!

How temporal (e.g. motion) and spatial (e.g. apperance) information materializes throughout the video diffusion generation process. We find that temporal and spatial attributes are encoded in distinct timestep intervals and are thus inherently decoupled.

Abstract

Text-to-video diffusion models jointly process spatial and temporal information while generating videos, although the interaction between these dimensions remains underexplored. In this work, we reveal that distinct timestep intervals in the diffusion process specialize in encoding temporal and spatial information. This property is consistent across models with different architectures, suggesting that it is a universal characteristic of video diffusion models. To demonstrate the practical utility of this property, we develop a timestep-constrained LoRA fine-tuning method that achieves precise motion customization without requiring explicit spatial debiasing. Our findings provide novel empirical insights into the internal mechanics of video diffusion models, enabling progress in downstream applications requiring motion understanding.

Video DDIM Inversion and Restoration

Decoded latents at various points of the diffusion process when applying DDIM inversion on an existing video starting from t=0. Up until t=600, the original motion of the video remains intact, while the spatial information is gradually removed.

Results of resampling the DDIM inverted latents of the existing video with a new prompt. In this case, the original prompt is "a monkey walking" and the new prompt is "a cat walking". When sampling unconditionally, we expect to retain the original video attributes; sampling with the new prompt will edit the generation result.

We find that applying the new prompt from t=600 to t=0 changes the appearance to a cat as per the new prompt while maintaining the original video's motion. In other words, motion information is localized within t=1000 to t=600 independently of spatial information.

Application: Motion Customization

Using our findings, we can efficiently fine-tune a diffusion model to replicate a motion by strictly training on timesteps from t=1000 to t=600. Because this is decoupled from the timesteps responsible for spatial information, we are able to transfer the reference motion to a new video without leakage of spatial attributes. Unlike other concurrent methods, our approach does not require any additional spatial debiasing training modules or loss functions.

Our method is versatile across different fine-tuning methods and base models. The first three examples are using the ModelScope model with LoRA, the fourth example is using the Latte model with LoRA, and the fifth example is using ModelScope with direct fine-tuning.

BibTeX

Coming soon!