DynVFX: Augmenting Real Videos with Dynamic Content

Supplementary Material

 

Click here to open the SM PDF


Press the spacebar to pause all videos simultaneously.

 


Our Results

We present sample results of our method, part of which are shown in Fig.4 in the paper.


Original video A majestic whale in the background
Original video A puppy peaking its head out of the box
Original video A majestic knight rides the horse
Original video A massive tsunami flooding the city, water everywhere, apocalyptic
Original video A playful golden retriever swimming trying to catch up with the joggers
Original video dinosaurs eating leaves from trees, taking big bites
Original video massive explosion from the mountain
Original video Many workers in the rice field, cutting crops
Original video A group of jellyfish floating
Original video A giraffe walking
Original video A elephant walking
Original video Balloons bouncing and swaying, surrounding the woman. Many, colorful.
Original video A puppy running beside the woman

 


Comparisons to Baselines

Existing methods are incapable of maintaining scene fidelity while introducing new content. Our method successfully adds new content to the scene, achieving high fidelity to the user instructions.while allowing for natural interactions between original and added elements in a realistic and seamless manner


A bear dancing with the woman Ours Finetuning
SDEdit Gen-3 DDIM Inversion

 


A fire breathing dragon chasing the dog Ours Finetuning
SDEdit Gen-3 DDIM Inversion

 


A pair of deer drinking water from the creek Ours Finetuning
SDEdit Gen-3 DDIM Inversion

 


Ablations

Ablations. (b) Excluding both AnchorExtAttn and the Iterative refinement process results in significant misalignment with the original scene and poor harmonization (e.g., size of the puppy relative to the scene and boundary artifacts). (c) Omitting AnchorExtAttn leads to incorrect positioning of the new content. (d) Removing iterative refinement results in poor harmonization. Our full method (e) exhibits good localization and harmonization of the edit.


A puppy peaking its head out of the box w/o AnchrExtAttn. and Iter. Refin. w/o AnchrExtAttn.
w/o Iter Refin. Our method

 


A massive tsunami flooding the city w/o AnchrExtAttn. and Iter. Refin. w/o AnchrExtAttn.
w/o Iter Refin. Our method

 


Extended Attention Analysis

Controlling fidelity to the original scene using different extended attention mechanisms.

(a-b) SDEdit suffers from the original scene preservation/edit fidelity trade-off. (c-e) Three Extended Attention variants during sampling demonstrate different control levels: Full Extended Attention closely reconstructs the input scene, Masked Extended Attention proves too constrained in overlapping regions despite allowing new content emergence, and our Anchor Ext. Attn. achieves optimal results by applying dropout – extending attention only at sparse points within selected regions..


A knight riding a horse Sampling T=0.9 Sampling T=0.6 Our method
Full Ext. Att. Masked Ext. Att. Anchor Ext. Att.

 


Limitations

In some cases, the T2V diffusion model struggles to precisely follow the edit prompt.


Original video A fish bowl encircling the boy's head, in bowl fish goldfish swimming around the boy's head