Motion-I2V

Abstract

We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation.

Method

The first stage of Motion-I2V targets at deducing the motions that can plausibly animate the reference image. It is conditioned on the reference image and text prompt, and predicts the motion field maps between the reference frame and all the future frames. The second stage propagates reference image's content to synthesize frames. A novel motion-augmented temporal layer enhances 1-D temporal attention with warped features. This operation enlarges the temporal receptive field and alleviates the complexity of directly learning the complicated spatial-temporal patterns.

Comparisons

"a fast driving tank"	Ours	Pika	Gen-2	DynamiCrafter

"zoom out view, landscape"

"a fast driving blue BMW"

"three clear ice cubes"

"a crawling snail"

Image-to-Video Demos

A portrait of a woman with a warm smile.	A woman blowing snow from her hands on a sunny winter day.	Chef Preparing Gourmet Dishes.	Stylish Skeleton with Afro and Sunglasses.
Chefs preparing food in a commercial kitchen.	A joyful mother lifting her baby in the air.	Burning Hundred Dollar Bill.	Pouring white wine from a clear glass bottle into a wineglass, with a blurry background.
Two clear ice cubes on a dark surface with water droplets.	A snowman in a forest caught in a burst of flames, with a backdrop of snow-covered trees.	A skilled barista pours steaming water into a series of glasses.	Birthday cupcakes with lit candles.
Tropical sunset view with silhouettes of palm trees and a reflective pool.	Fresh Raindrops on Vibrant Green Leaf.	Industrial Smokestacks Emitting Smoke into the Sky.	A vintage yellow biplane flying over a canyon with a river.
Sunset over misty mountains with clouds in the sky.	A sailboat on a clear day with a mountainous coastline in the distance.	A red fox standing in the snow.	A happy corgi dog with a reddish-brown and white coat sitting on grass, tongue out, looking cheerful.
A squirrel eating a nut on a wooden fence.	Contemplative Dog by the Waterside at Sunset.	Alert Meerkat Standing on a Log.	Pensive Monkey Seated.