MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Tuna Han Salih Meral; Hidir Yesiltepe; Connor Dunlop; Pinar Yanardag

doi:10.1609/aaai.v40i10.37750

Back to AAAI

AAAI 2026

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Conference Paper AAAI Technical Track on Computer Vision VII Artificial Intelligence

PDF Details DOI

Abstract

Text-to-video models have demonstrated impressive capabilities in producing diverse video content, yet often lack fine-grained control over motion. We address the problem of motion transfer: given a source video and a target text prompt, generate a new video that preserves the source motion while matching the target semantics and allowing large changes in appearance and scene layout. We introduce MotionFlow, a training-free framework that performs test-time latent optimization guided by attention-derived motion cues. MotionFlow first extracts cross-attention maps from a pre-trained video diffusion model and converts them into spatio-temporal motion masks for the source subject. During generation, it optimizes the target latents so that their evolving attention patterns align with these masks, while the target text controls appearance. This avoids direct attention-map replacement and any model-specific fine-tuning, reducing artifacts and improving flexibility. Qualitative and quantitative experiments, including a user study, show that MotionFlow outperforms existing methods in motion fidelity, temporal consistency, and versatility, even under drastic scene changes.

MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models

Abstract

Authors

Keywords

Context